Difference between revisions of "AVX-512"

From Chessprogramming wiki
Jump to: navigation, search
 
(8 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
'''AVX-512''',<br/>
 
'''AVX-512''',<br/>
an expansion of [[Intel|Intel's]] the [[AVX]] and [[AVX2]] instructions using the [https://en.wikipedia.org/wiki/EVEX_prefix EVEX prefix], featuring '''32''' 512-bit wide vector [[SIMD and SWAR Techniques|SIMD]] registers zmm0 through zmm31, keeping either eight [[Double|doubles]] or integer [[Quad Word|quad words]] such as [[Bitboards|bitboards]], and eight (seven) dedicated mask registers which specify which vector elements are operated on and written. If the Nth bit of a vector mask register is set, then the Nth element of the destination vector is overridden with the result of the operation; otherwise, dependent of the instruction, the element is zeroed, or overridden by an element from another source register (remains unchanged if same source). A vector mask register can be set using vector compare instructions, instructions to move contents from a GP register, or a special subset of vector mask arithmetic instructions.
+
an expansion of [[Intel|Intel's]] [[AVX]] and [[AVX2]] instructions using the [https://en.wikipedia.org/wiki/EVEX_prefix EVEX prefix], featuring '''32''' 512-bit wide vector [[SIMD and SWAR Techniques|SIMD]] registers zmm0 through zmm31, keeping either eight [[Double|doubles]] or integer [[Quad Word|quad words]] such as [[Bitboards|bitboards]], and eight (seven) dedicated mask registers which specify which vector elements are operated on and written. If the Nth bit of a vector mask register is set, then the Nth element of the destination vector is overridden with the result of the operation; otherwise, dependent of the instruction, the element is zeroed, or overridden by an element from another source register (remains unchanged if same source). A vector mask register can be set using vector compare instructions, instructions to move contents from a GP register, or a special subset of vector mask arithmetic instructions.
  
 
=Extensions=  
 
=Extensions=  
Line 15: Line 15:
 
Reg:Bit <ref>[https://www.heise.de/ct/zcontent/17/16-hocmsmeta/1501873687265857/ct.1617.016-017.qxp_table_29578.html AVX512 table] from [https://en.wikipedia.org/wiki/Heinz_Heise Heise]</ref>  
 
Reg:Bit <ref>[https://www.heise.de/ct/zcontent/17/16-hocmsmeta/1501873687265857/ct.1617.016-017.qxp_table_29578.html AVX512 table] from [https://en.wikipedia.org/wiki/Heinz_Heise Heise]</ref>  
 
|-
 
|-
|  AVX-512F
+
|  AVX-512 F
 
|  Foundation  
 
|  Foundation  
 
| rowspan="4" | [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing Knights Landing]  
 
| rowspan="4" | [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing Knights Landing]  
 
|  EBX:16  
 
|  EBX:16  
 
|-
 
|-
|  AVX-512CD
+
|  AVX-512 CD
 
|  Conflict Detection Instructions  
 
|  Conflict Detection Instructions  
 
|  EBX:28  
 
|  EBX:28  
 
|-
 
|-
|  AVX-512ER
+
|  AVX-512 ER
 
|  Exponential and Reciprocal Instructions  
 
|  Exponential and Reciprocal Instructions  
 
|  EBX:27  
 
|  EBX:27  
 
|-
 
|-
|  AVX-512PF
+
|  AVX-512 PF
 
|  Prefetch Instructions  
 
|  Prefetch Instructions  
 
|  EBX:26  
 
|  EBX:26  
 
|-
 
|-
|  AVX-512BW
+
|  AVX-512 BW
 
|  [[Byte]] and [[Word]] Instructions  
 
|  [[Byte]] and [[Word]] Instructions  
 
| rowspan="3"  | [https://en.wikipedia.org/wiki/Skylake_(microarchitecture) Skylake X]  
 
| rowspan="3"  | [https://en.wikipedia.org/wiki/Skylake_(microarchitecture) Skylake X]  
 
|  EBX:30  
 
|  EBX:30  
 
|-
 
|-
|  AVX-512DQ
+
|  AVX-512 DQ
 
|  [[Double Word|Doubleword]] and [[Quad Word|Quadword]] Instructions  
 
|  [[Double Word|Doubleword]] and [[Quad Word|Quadword]] Instructions  
 
|  EBX:17  
 
|  EBX:17  
 
|-
 
|-
|  AVX-512VL
+
|  AVX-512 VL
 
|  Vector Length Extensions  
 
|  Vector Length Extensions  
 
|  EBX:31  
 
|  EBX:31  
 
|-
 
|-
|  AVX-512IFMA
+
|  AVX-512 IFMA
 
|  Integer Fused Multiply Add  
 
|  Integer Fused Multiply Add  
 
| rowspan="2" | [https://en.wikipedia.org/wiki/Cannonlake Cannonlake]  
 
| rowspan="2" | [https://en.wikipedia.org/wiki/Cannonlake Cannonlake]  
 
|  EBX:21  
 
|  EBX:21  
 
|-
 
|-
|  AVX-512VBMI
+
|  AVX-512 VBMI
 
|  Vector Byte Manipulation Instructions  
 
|  Vector Byte Manipulation Instructions  
 
|  ECX:01  
 
|  ECX:01  
 
|-
 
|-
|  AVX-512VPOPCNTDQ
+
|  AVX-512 VPOPCNTDQ
 
|  Vector [[Population Count]]  
 
|  Vector [[Population Count]]  
 
| rowspan="3" |  [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Mill Knights Mill]     
 
| rowspan="3" |  [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Mill Knights Mill]     
Line 66: Line 66:
 
|  Fused Multiply Accumulation<br/>Packed Single precision  
 
|  Fused Multiply Accumulation<br/>Packed Single precision  
 
|  EDX:03  
 
|  EDX:03  
 +
|-
 +
|  AVX-512 VNNI
 +
|  Vector Neural Network Instructions <br/>Vector Instructions for [[Deep Learning]]
 +
| rowspan="4" |  [https://en.wikipedia.org/wiki/Ice_Lake_(microprocessor) Ice Lake]
 +
|  ECX:11
 +
|-
 +
|  AVX-512 VBMI2
 +
|  Vector Byte Manipulation Instructions 2<br/>[[Byte]]/[[Word]] Load, Store and Concatenation with Shift
 +
|
 +
|-
 +
|  AVX-512 BITALG
 +
|  Bit Algorithms<br/>Byte/Word Bit Manipulation Instructions expanding VPOPCNTDQn
 +
|
 +
|-
 +
|  AVX-512 GFNI
 +
|  Galois field New Instructions<br/>Vector Instructions for calculating [https://en.wikipedia.org/wiki/Finite_field Galois Field] GF(2^8)
 +
|     
 
|}
 
|}
  
 
=Selected Instructions=  
 
=Selected Instructions=  
 
==VPTERNLOG<span id="VPTERNLOG"></span>==  
 
==VPTERNLOG<span id="VPTERNLOG"></span>==  
AVX-512F features the instruction VPTERNLOGQ (or VPTERNLOGD) to perform bitwise [https://en.wikipedia.org/wiki/Ternary_operation ternary logic], for instance to [[General Setwise Operations|operate]] on vectors of [[Bitboards|bitboards]]. Three input vectors are bitwise [[Combinatorial Logic|combined]] by an operation determined by an immediate byte operand ('''imm8'''), whose 256 possible values corresponds with the boolean output vector of the [https://en.wikipedia.org/wiki/Truth_table truth table] for all eight combinations of the three input bits, as demonstrated with some selected imm8 values in the table below <ref>[http://0x80.pl/articles/avx512-ternary-functions.html AVX512: ternary functions evaluation] by [[Wojciech Muła]], March 03, 2015</ref> <ref>[https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf) 5.3 TERNARY BIT VECTOR LOGIC TABLE</ref> :
+
AVX-512 F features the instruction VPTERNLOGQ (or VPTERNLOGD) to perform bitwise [https://en.wikipedia.org/wiki/Ternary_operation ternary logic], for instance to [[General Setwise Operations|operate]] on vectors of [[Bitboards|bitboards]]. Three input vectors are bitwise [[Combinatorial Logic|combined]] by an operation determined by an immediate byte operand ('''imm8'''), whose 256 possible values corresponds with the boolean output vector of the [https://en.wikipedia.org/wiki/Truth_table truth table] for all eight combinations of the three input bits, as demonstrated with some selected imm8 values in the table below <ref>[http://0x80.pl/articles/avx512-ternary-functions.html AVX512: ternary functions evaluation] by [[Wojciech Muła]], March 03, 2015</ref> <ref>[https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf) 5.3 TERNARY BIT VECTOR LOGIC TABLE</ref> :
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
Line 256: Line 273:
 
Following VPTERNLOGQ intrinsics are declared, where the maskz version sets unmasked destination quad word elements to zero, while the mask version copies unmasked elements from s:
 
Following VPTERNLOGQ intrinsics are declared, where the maskz version sets unmasked destination quad word elements to zero, while the mask version copies unmasked elements from s:
 
<pre>
 
<pre>
 +
__m256i _mm256_ternarylogic_epi64(__m256i a, __m256i b, __m256i c, int imm8);
 +
__m256i _mm256_maskz_ternarylogic_epi64(__mmask8 k, __m256i a, __m256i b, __m256i c, int imm8);
 +
__m256i _mm256_mask_ternarylogic_epi64(__m256i src, __mmask8 k, __m256i a, __m256i b, int imm8);
 
__m512i _mm512_ternarylogic_epi64(__m512i a, __m512i b, __m512i c, int imm8);
 
__m512i _mm512_ternarylogic_epi64(__m512i a, __m512i b, __m512i c, int imm8);
 
__m512i _mm512_maskz_ternarylogic_epi64( __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);
 
__m512i _mm512_maskz_ternarylogic_epi64( __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);
Line 262: Line 282:
 
<span id="VPLZCNT"></span>
 
<span id="VPLZCNT"></span>
 
==VPLZCNT==  
 
==VPLZCNT==  
AVX-512CD has Vector [[BitScan#LeadingZeroCount|Leading Zero Count]] - VPLZCNTQ counts leading zeroes on a vector of eight bitboards in parallel <ref>[https://www.google.com/patents/US9372692 Patent US9372692 - Methods, apparatus, instructions, and logic to provide permute controls with leading zero count functionality - Google Patent Search]</ref> - using following intrinsics <ref>[https://hjlebbink.github.io/x86doc/html/VPLZCNTD_Q.html VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values]</ref>, where the maskz version sets unmasked destination elements to zero, while the mask version copies unmasked elements from s:
+
AVX-512 CD has Vector [[BitScan#LeadingZeroCount|Leading Zero Count]] - VPLZCNTQ counts leading zeroes on a vector of eight bitboards in parallel <ref>[https://www.google.com/patents/US9372692 Patent US9372692 - Methods, apparatus, instructions, and logic to provide permute controls with leading zero count functionality - Google Patent Search]</ref> - using following intrinsics <ref>[https://hjlebbink.github.io/x86doc/html/VPLZCNTD_Q.html VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values]</ref>, where the maskz version sets unmasked destination elements to zero, while the mask version copies unmasked elements from s:
 
<pre>
 
<pre>
 +
__m256i _mm256_lzcnt_epi64(__m256i a);
 +
__m256i _mm256_maskz_lzcnt_epi64(__mmask8 k, __m256i a);
 +
__m256i _mm256_mask_lzcnt_epi64(__m256i src, __mmask8 k, __m256i a);
 
__m512i _mm512_lzcnt_epi64(__m512i a);
 
__m512i _mm512_lzcnt_epi64(__m512i a);
__m512i _mm512_maskz_lzcnt_epi64(__mmask8 m, __m512i a);
+
__m512i _mm512_maskz_lzcnt_epi64(__mmask8 k, __m512i a);
__m512i _mm512_mask_lzcnt_epi64(__m512i s, __mmask8 m, __m512i a);
+
__m512i _mm512_mask_lzcnt_epi64(__m512i src, __mmask8 k, __m512i a);
 
</pre>
 
</pre>
 
<span id="VPOPCNT"></span>
 
<span id="VPOPCNT"></span>
 
==VPOPCNT==
 
==VPOPCNT==
The future AVX-512VPOPCNTDQ extension has a vector [[Population Count|population count]] instruction to count one bits of either 16 32-bit double words (VPOPCNTD) or 8 64-bit quad words aka bitboards (VPOPCNTQ) in parallel <ref>[https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx512-harley-seal.cpp sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub]</ref> <ref>[[Wojciech Muła]], [http://dblp.uni-trier.de/pers/hd/k/Kurz:Nathan Nathan Kurz], [https://github.com/lemire Daniel Lemire] ('''2016'''). ''Faster Population Counts Using AVX2 Instructions''. [https://arxiv.org/abs/1611.07612 arXiv:1611.07612]</ref>.  
+
The AVX-512 VPOPCNTDQ extension has a vector [[Population Count|population count]] instruction to count one bits of either 16 32-bit double words (VPOPCNTD) or 8 64-bit quad words aka bitboards (VPOPCNTQ) in parallel <ref>[https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx512-harley-seal.cpp sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub]</ref> <ref>[[Wojciech Muła]], [http://dblp.uni-trier.de/pers/hd/k/Kurz:Nathan Nathan Kurz], [https://github.com/lemire Daniel Lemire] ('''2016'''). ''Faster Population Counts Using AVX2 Instructions''. [https://arxiv.org/abs/1611.07612 arXiv:1611.07612]</ref> <ref>[https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=VPOPCNTD&expand=4368 Intel® Intrinsics Guide VPOPCNTD]</ref>.
 +
<pre>
 +
__m128i _mm_mask_popcnt_epi32(__m128i src, __mmask8 k, __m128i a);
 +
__m128i _mm_maskz_popcnt_epi32(__mmask8 k, __m128i a);
 +
__m128i _mm_popcnt_epi3 (__m128i a);
 +
__m256i _mm256_mask_popcnt_epi32(__m256i src, __mmask8 k, __m256i a);
 +
__m256i _mm256_maskz_popcnt_epi32(__mmask8 k, __m256i a);
 +
__m256i _mm256_popcnt_epi32(__m256i a);
 +
__m512i _mm512_mask_popcnt_epi32(__m512i src, __mmask16 k, __m512i a);
 +
__m512i _mm512_maskz_popcnt_epi32(__mmask16 k, __m512i a);
 +
__m512i _mm512_popcnt_epi32(__m512i a);
  
 +
__m128i _mm_mask_popcnt_epi64(__m128i src, __mmask8 k, __m128i a);
 +
__m128i _mm_maskz_popcnt_epi64(__mmask8 k, __m128i a);
 +
__m128i _mm_popcnt_epi64(__m128i a);
 +
__m256i _mm256_mask_popcnt_epi64(__m256i src, __mmask8 k, __m256i a);
 +
__m256i _mm256_maskz_popcnt_epi64(__mmask8 k, __m256i a);
 +
__m256i _mm256_popcnt_epi64(__m256i a);
 +
__m512i _mm512_mask_popcnt_epi64(__m512i src, __mmask8 k, __m512i a);
 +
__m512i _mm512_maskz_popcnt_epi64(__mmask8 k, __m512i a);
 +
__m512i _mm512_popcnt_epi64(__m512i a)
 +
</pre>
 +
==VPDPBUSD==
 +
The AVX-512 VNNI extension features several instructions speeding up [[Neural Networks|neural network]] and [[Deep Learning|deep learning]] calculations on the CPU, for instance [[NNUE]] inference using uint8/int8. VPDPBUSD - Multiply and Add Unsigned and Signed Bytes <ref>[https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2168,2201&text=VPDPBUSD&avx512techs=AVX512_VNNI Intel® Intrinsics Guide VPDPBUSD]</ref>, executes on both port 0 and port 5 in one cycle <ref>[https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html Lower Numerical Precision Deep Learning Inference and Training] by [https://community.intel.com/t5/user/viewprofilepage/user-id/134067 Andres Rodriguez] et al., January 19, 2018</ref>.
 +
<pre>
 +
__m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
 +
{
 +
  for (j=0; j < 16; j++) {
 +
    tmp1.word := Signed(ZeroExtend16(a.byte[4*j  ]) * SignExtend16(b.byte[4*j  ]);
 +
    tmp2.word := Signed(ZeroExtend16(a.byte[4*j+1]) * SignExtend16(b.byte[4*j+1]);
 +
    tmp3.word := Signed(ZeroExtend16(a.byte[4*j+2]) * SignExtend16(b.byte[4*j+2]);
 +
    tmp4.word := Signed(ZeroExtend16(a.byte[4*j+3]) * SignExtend16(b.byte[4*j+3]);
 +
    dst.dword[j] := src.dword[j] + tmp1 + tmp2 + tmp3 + tmp4
 +
  }
 +
  return dst;
 +
}
 +
</pre>
 
=See also=  
 
=See also=  
 
* [[CFish#AVX2 Attacks|CFish - AVX2 Attacks]]
 
* [[CFish#AVX2 Attacks|CFish - AVX2 Attacks]]
Line 285: Line 343:
 
* [[SSE2]]
 
* [[SSE2]]
 
* [[XOP]]
 
* [[XOP]]
 +
 +
=Publications=
 +
* [https://os.itec.kit.edu/21_3247.php Mathias Gottschlag], [https://os.itec.kit.edu/21_31.php Frank Bellosa] ('''2018'''). ''[https://os.itec.kit.edu/21_3486.php Mechanism to Mitigate AVX-Induced Frequency Reduction]''. [https://arxiv.org/abs/1901.04982 arXiv:1901.04982]
 +
* [https://os.itec.kit.edu/21_3247.php Mathias Gottschlag], [https://os.itec.kit.edu/97_3742.php Philipp Machauer], [https://os.itec.kit.edu/21_3571.php Yussuf Khalil], [https://os.itec.kit.edu/21_31.php Frank Bellosa] ('''2021'''). ''[https://www.usenix.org/conference/atc21/presentation/gottschlag Fair Scheduling for AVX2 and AVX-512 Workloads]''. [https://www.usenix.org/conference/atc21 USENIX ATC '21]
  
 
=Manuals=  
 
=Manuals=  
 
* [https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf)
 
* [https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf)
 +
* [https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-optimization-reference-manual.html Intel® 64 and IA-32 Architectures Optimization Reference Manual]
 +
 +
=Forum Posts=
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75049 AVX-512 and NNUE] by [[Gian-Carlo Pascutto]], [[CCC]], September 08, 2020 » [[NNUE]]
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=77246 VPOPCNTDQ and VBMI2] by [[Vivien Clauzon]], [[CCC]], May 04, 2021
  
 
=External Links=  
 
=External Links=  
Line 293: Line 360:
 
* [https://en.wikichip.org/wiki/x86/avx512_vnni AVX-512 Vector Neural Network Instructions (VNNI) - x86 - WikiChip]
 
* [https://en.wikichip.org/wiki/x86/avx512_vnni AVX-512 Vector Neural Network Instructions (VNNI) - x86 - WikiChip]
 
* [https://www.intel.com/content/www/us/en/architecture-and-technology/avx-512-overview.html Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Overview]
 
* [https://www.intel.com/content/www/us/en/architecture-and-technology/avx-512-overview.html Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Overview]
* [https://software.intel.com/en-us/intel-isa-extensions Intel Instruction Set Architecture Extensions]
+
* [https://software.intel.com/en-us/intel-isa-extensions Intel Instruction Set Architecture Extensions]  
==Blog Postings==  
+
==Blogs==  
 
* [https://software.intel.com/en-us/blogs/2013/avx-512-instructions AVX-512 instructions | Intel® Developer Zone] by [https://software.intel.com/en-us/user/335550 James Reinders], July 23, 2013
 
* [https://software.intel.com/en-us/blogs/2013/avx-512-instructions AVX-512 instructions | Intel® Developer Zone] by [https://software.intel.com/en-us/user/335550 James Reinders], July 23, 2013
 
* [http://www.agner.org/optimize/blog/read.php?i=288 Future instruction set: AVX-512] by [http://www.agner.org/ Agner Fog], October, 09, 2013
 
* [http://www.agner.org/optimize/blog/read.php?i=288 Future instruction set: AVX-512] by [http://www.agner.org/ Agner Fog], October, 09, 2013
Line 300: Line 367:
 
* [https://software.intel.com/en-us/blogs/2014/07/24/processing-arrays-of-bits-with-intel-advanced-vector-extensions-512-intel-avx-512 Processing Arrays of Bits with Intel® Advanced Vector Extensions 512 (Intel® AVX-512) | Intel® Developer Zone] by [https://software.intel.com/en-us/user/123920 Thomas Willhalm], July 24, 2014
 
* [https://software.intel.com/en-us/blogs/2014/07/24/processing-arrays-of-bits-with-intel-advanced-vector-extensions-512-intel-avx-512 Processing Arrays of Bits with Intel® Advanced Vector Extensions 512 (Intel® AVX-512) | Intel® Developer Zone] by [https://software.intel.com/en-us/user/123920 Thomas Willhalm], July 24, 2014
 
* [https://www.hpcwire.com/2017/06/29/reinders-avx-512-may-hidden-gem-intel-xeon-scalable-processors/ AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors] by [https://software.intel.com/en-us/user/335550 James Reinders], [https://www.hpcwire.com/ HPCwire], June 29, 2017
 
* [https://www.hpcwire.com/2017/06/29/reinders-avx-512-may-hidden-gem-intel-xeon-scalable-processors/ AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors] by [https://software.intel.com/en-us/user/335550 James Reinders], [https://www.hpcwire.com/ HPCwire], June 29, 2017
 +
* [https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html Lower Numerical Precision Deep Learning Inference and Training] by [https://community.intel.com/t5/user/viewprofilepage/user-id/134067 Andres Rodriguez] et al., January 19, 2018
 
==Compiler Support==  
 
==Compiler Support==  
 
* [https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX_512 Intel Intrinsics Guide - AVX-512]
 
* [https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX_512 Intel Intrinsics Guide - AVX-512]
Line 308: Line 376:
 
=References=  
 
=References=  
 
<references />
 
<references />
 
 
'''[[x86-64|Up one Level]]'''
 
'''[[x86-64|Up one Level]]'''

Latest revision as of 13:42, 17 March 2022

Home * Hardware * x86-64 * AVX-512

AVX-512,
an expansion of Intel's AVX and AVX2 instructions using the EVEX prefix, featuring 32 512-bit wide vector SIMD registers zmm0 through zmm31, keeping either eight doubles or integer quad words such as bitboards, and eight (seven) dedicated mask registers which specify which vector elements are operated on and written. If the Nth bit of a vector mask register is set, then the Nth element of the destination vector is overridden with the result of the operation; otherwise, dependent of the instruction, the element is zeroed, or overridden by an element from another source register (remains unchanged if same source). A vector mask register can be set using vector compare instructions, instructions to move contents from a GP register, or a special subset of vector mask arithmetic instructions.

Extensions

AVX-512 consists of multiple extensions not all meant to be supported by all AVX-512 capable processors. Only the core extension AVX-512F (AVX-512 Foundation) is required by all implementations [1] AVX-512F and AVX-512CD were first implemented in the Xeon Phi processor and coprocessor known by the code name Knights Landing [2] , launched on June 20, 2016.

Extension Description Architecture CPUID 7

Reg:Bit [3]

AVX-512 F Foundation Knights Landing EBX:16
AVX-512 CD Conflict Detection Instructions EBX:28
AVX-512 ER Exponential and Reciprocal Instructions EBX:27
AVX-512 PF Prefetch Instructions EBX:26
AVX-512 BW Byte and Word Instructions Skylake X EBX:30
AVX-512 DQ Doubleword and Quadword Instructions EBX:17
AVX-512 VL Vector Length Extensions EBX:31
AVX-512 IFMA Integer Fused Multiply Add Cannonlake EBX:21
AVX-512 VBMI Vector Byte Manipulation Instructions ECX:01
AVX-512 VPOPCNTDQ Vector Population Count Knights Mill ECX:14
AVX-512-4VNNIW Vector Neural Network Instructions
Word variable precision
EDX:02
AVX-512-4FMAPS Fused Multiply Accumulation
Packed Single precision
EDX:03
AVX-512 VNNI Vector Neural Network Instructions
Vector Instructions for Deep Learning
Ice Lake ECX:11
AVX-512 VBMI2 Vector Byte Manipulation Instructions 2
Byte/Word Load, Store and Concatenation with Shift
AVX-512 BITALG Bit Algorithms
Byte/Word Bit Manipulation Instructions expanding VPOPCNTDQn
AVX-512 GFNI Galois field New Instructions
Vector Instructions for calculating Galois Field GF(2^8)

Selected Instructions

VPTERNLOG

AVX-512 F features the instruction VPTERNLOGQ (or VPTERNLOGD) to perform bitwise ternary logic, for instance to operate on vectors of bitboards. Three input vectors are bitwise combined by an operation determined by an immediate byte operand (imm8), whose 256 possible values corresponds with the boolean output vector of the truth table for all eight combinations of the three input bits, as demonstrated with some selected imm8 values in the table below [4] [5] :

Input Output of Operations
imm8 0x00 0x01 0x16 0x17 0x28 0x80 0x88 0x96 0xca 0xe8 0xfe 0xff
# a b c C-exp false ~(a|b|c) a?~(b|c):b^c minor(a,b,c) c&(a^b) a&b&c b&c a^b^c a?b:c major(a,b,c) a|b|c true
0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1
1 0 0 1 0 0 1 1 0 0 0 1 1 0 1 1
2 0 1 0 0 0 1 1 0 0 0 1 0 0 1 1
3 0 1 1 0 0 0 0 1 0 1 0 1 1 1 1
4 1 0 0 0 0 1 1 0 0 0 1 0 0 1 1
5 1 0 1 0 0 0 0 1 0 0 0 0 1 1 1
6 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1
7 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1

Following VPTERNLOGQ intrinsics are declared, where the maskz version sets unmasked destination quad word elements to zero, while the mask version copies unmasked elements from s:

__m256i _mm256_ternarylogic_epi64(__m256i a, __m256i b, __m256i c, int imm8);
__m256i _mm256_maskz_ternarylogic_epi64(__mmask8 k, __m256i a, __m256i b, __m256i c, int imm8);
__m256i _mm256_mask_ternarylogic_epi64(__m256i src, __mmask8 k, __m256i a, __m256i b, int imm8);
__m512i _mm512_ternarylogic_epi64(__m512i a, __m512i b, __m512i c, int imm8);
__m512i _mm512_maskz_ternarylogic_epi64( __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);
__m512i _mm512_mask_ternarylogic_epi64(__m512i s, __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);

VPLZCNT

AVX-512 CD has Vector Leading Zero Count - VPLZCNTQ counts leading zeroes on a vector of eight bitboards in parallel [6] - using following intrinsics [7], where the maskz version sets unmasked destination elements to zero, while the mask version copies unmasked elements from s:

__m256i _mm256_lzcnt_epi64(__m256i a);
__m256i _mm256_maskz_lzcnt_epi64(__mmask8 k, __m256i a);
__m256i _mm256_mask_lzcnt_epi64(__m256i src, __mmask8 k, __m256i a);
__m512i _mm512_lzcnt_epi64(__m512i a);
__m512i _mm512_maskz_lzcnt_epi64(__mmask8 k, __m512i a);
__m512i _mm512_mask_lzcnt_epi64(__m512i src, __mmask8 k, __m512i a);

VPOPCNT

The AVX-512 VPOPCNTDQ extension has a vector population count instruction to count one bits of either 16 32-bit double words (VPOPCNTD) or 8 64-bit quad words aka bitboards (VPOPCNTQ) in parallel [8] [9] [10].

__m128i _mm_mask_popcnt_epi32(__m128i src, __mmask8 k, __m128i a);
__m128i _mm_maskz_popcnt_epi32(__mmask8 k, __m128i a);
__m128i _mm_popcnt_epi3 (__m128i a);
__m256i _mm256_mask_popcnt_epi32(__m256i src, __mmask8 k, __m256i a);
__m256i _mm256_maskz_popcnt_epi32(__mmask8 k, __m256i a);
__m256i _mm256_popcnt_epi32(__m256i a);
__m512i _mm512_mask_popcnt_epi32(__m512i src, __mmask16 k, __m512i a);
__m512i _mm512_maskz_popcnt_epi32(__mmask16 k, __m512i a);
__m512i _mm512_popcnt_epi32(__m512i a);

__m128i _mm_mask_popcnt_epi64(__m128i src, __mmask8 k, __m128i a);
__m128i _mm_maskz_popcnt_epi64(__mmask8 k, __m128i a);
__m128i _mm_popcnt_epi64(__m128i a);
__m256i _mm256_mask_popcnt_epi64(__m256i src, __mmask8 k, __m256i a);
__m256i _mm256_maskz_popcnt_epi64(__mmask8 k, __m256i a);
__m256i _mm256_popcnt_epi64(__m256i a);
__m512i _mm512_mask_popcnt_epi64(__m512i src, __mmask8 k, __m512i a);
__m512i _mm512_maskz_popcnt_epi64(__mmask8 k, __m512i a);
__m512i _mm512_popcnt_epi64(__m512i a)

VPDPBUSD

The AVX-512 VNNI extension features several instructions speeding up neural network and deep learning calculations on the CPU, for instance NNUE inference using uint8/int8. VPDPBUSD - Multiply and Add Unsigned and Signed Bytes [11], executes on both port 0 and port 5 in one cycle [12].

 
__m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
{
  for (j=0; j < 16; j++) {
    tmp1.word := Signed(ZeroExtend16(a.byte[4*j  ]) * SignExtend16(b.byte[4*j  ]);
    tmp2.word := Signed(ZeroExtend16(a.byte[4*j+1]) * SignExtend16(b.byte[4*j+1]);
    tmp3.word := Signed(ZeroExtend16(a.byte[4*j+2]) * SignExtend16(b.byte[4*j+2]);
    tmp4.word := Signed(ZeroExtend16(a.byte[4*j+3]) * SignExtend16(b.byte[4*j+3]);
    dst.dword[j] := src.dword[j] + tmp1 + tmp2 + tmp3 + tmp4
  }
  return dst;
}

See also

SIMD

Publications

Manuals

Forum Posts

External Links

Blogs

Compiler Support

References

Up one Level