Changes

Jump to: navigation, search

AVX-512

14,472 bytes added, 15:30, 9 August 2018
Created page with "'''Home * Hardware * x86-64 * AVX-512''' '''AVX-512''',<br/> an expansion of Intel's the AVX and AVX2 instructions using the [https://..."
'''[[Main Page|Home]] * [[Hardware]] * [[x86-64]] * AVX-512'''

'''AVX-512''',<br/>
an expansion of [[Intel|Intel's]] the [[AVX]] and [[AVX2]] instructions using the [https://en.wikipedia.org/wiki/EVEX_prefix EVEX prefix], featuring '''32''' 512-bit wide vector [[SIMD and SWAR Techniques|SIMD]] registers zmm0 through zmm31, keeping either eight [[Double|doubles]] or integer [[Quad Word|quad words]] such as [[Bitboards|bitboards]], and eight (seven) dedicated mask registers which specify which vector elements are operated on and written. If the Nth bit of a vector mask register is set, then the Nth element of the destination vector is overridden with the result of the operation; otherwise, dependent of the instruction, the element is zeroed, or overridden by an element from another source register (remains unchanged if same source). A vector mask register can be set using vector compare instructions, instructions to move contents from a GP register, or a special subset of vector mask arithmetic instructions.

=Extensions=
AVX-512 consists of multiple extensions not all meant to be supported by all AVX-512 capable processors. Only the core extension AVX-512F (AVX-512 Foundation) is required by all implementations <ref>[https://en.wikipedia.org/wiki/AVX-512 AVX-512 from Wikipedia]</ref> AVX-512F and AVX-512CD were first implemented in the [https://en.wikipedia.org/wiki/Xeon_Phi Xeon Phi] processor and coprocessor known by the code name [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing Knights Landing] <ref>[https://software.intel.com/en-us/blogs/additional-avx-512-instructions Additional AVX-512 instructions | Intel® Developer Zone] by [https://software.intel.com/en-us/user/335550 James Reinders], July 17, 2014</ref> , launched on June 20, 2016.

{| class="wikitable"
|-
! Extension
! Description
! Architecture
! [https://en.wikipedia.org/wiki/CPUID#EAX.3D7.2C_ECX.3D0:_Extended_Features CPUID 7]
Reg:Bit <ref>[https://www.heise.de/ct/zcontent/17/16-hocmsmeta/1501873687265857/ct.1617.016-017.qxp_table_29578.html AVX512 table] from [https://en.wikipedia.org/wiki/Heinz_Heise Heise]</ref>
|-
| AVX-512F
| Foundation
| rowspan="4" | [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing Knights Landing]
| EBX:16
|-
| AVX-512CD
| Conflict Detection Instructions
| EBX:28
|-
| AVX-512ER
| Exponential and Reciprocal Instructions
| EBX:27
|-
| AVX-512PF
| Prefetch Instructions
| EBX:26
|-
| AVX-512BW
| [[Byte]] and [[Word]] Instructions
| rowspan="3" | [https://en.wikipedia.org/wiki/Skylake_(microarchitecture) Skylake X]
| EBX:30
|-
| AVX-512DQ
| [[Double Word|Doubleword]] and [[Quad Word|Quadword]] Instructions
| EBX:17
|-
| AVX-512VL
| Vector Length Extensions
| EBX:31
|-
| AVX-512IFMA
| Integer Fused Multiply Add
| rowspan="2" | [https://en.wikipedia.org/wiki/Cannonlake Cannonlake]
| EBX:21
|-
| AVX-512VBMI
| Vector Byte Manipulation Instructions
| ECX:01
|-
| AVX-512VPOPCNTDQ
| Vector [[Population Count]]
| rowspan="3" | [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Mill Knights Mill]
| ECX:14
|-
| AVX-512-4VNNIW
| Vector [[Neural Networks|Neural Network]] Instructions<br/>Word variable precision
| EDX:02
|-
| AVX-512-4FMAPS
| Fused Multiply Accumulation<br/>Packed Single precision
| EDX:03
|}

=Selected Instructions=
==VPTERNLOG<span id="VPTERNLOG"></span>==
AVX-512F features the instruction VPTERNLOGQ (or VPTERNLOGD) to perform bitwise [https://en.wikipedia.org/wiki/Ternary_operation ternary logic], for instance to [[General Setwise Operations|operate]] on vectors of [[Bitboards|bitboards]]. Three input vectors are bitwise [[Combinatorial Logic|combined]] by an operation determined by an immediate byte operand ('''imm8'''), whose 256 possible values corresponds with the boolean output vector of the [https://en.wikipedia.org/wiki/Truth_table truth table] for all eight combinations of the three input bits, as demonstrated with some selected imm8 values in the table below <ref>[http://0x80.pl/articles/avx512-ternary-functions.html AVX512: ternary functions evaluation] by [[Wojciech Muła]], March 03, 2015</ref> <ref>[https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf) 5.3 TERNARY BIT VECTOR LOGIC TABLE</ref> :
{| class="wikitable"
|-
! colspan="4" | Input
!
! colspan="12" | Output of Operations
|-
! colspan="4" |
! imm8
| style="text-align:center;" | 0x00
| style="text-align:center;" | 0x01
| style="text-align:center;" | 0x16
| style="text-align:center;" | 0x17
| style="text-align:center;" | 0x28
| style="text-align:center;" | 0x80
| style="text-align:center;" | 0x88
| style="text-align:center;" | 0x96
| style="text-align:center;" | 0xca
| style="text-align:center;" | 0xe8
| style="text-align:center;" | 0xfe
| style="text-align:center;" | 0xff
|-
! #
! a
! b
! c
! C-exp
| style="text-align:center;" | false
| style="text-align:center;" | ~(a|b|c)
| style="text-align:center;" | a?~(b|c):b^c
| minor(a,b,c)
| style="text-align:center;" | c&(a^b)
| style="text-align:center;" | a&b&c
| style="text-align:center;" | b&c
| style="text-align:center;" | a^b^c
| style="text-align:center;" | a?b:c
| style="text-align:center;" | [[General Setwise Operations#Majority|major]](a,b,c)
| style="text-align:center;" | a|b|c
| style="text-align:center;" | true
|-
! 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
!
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
|-
! 1
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
!
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 1
|-
! 2
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 0
!
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 1
|-
! 3
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 1
!
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 1
|-
! 4
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 0
!
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 1
|-
! 5
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 1
!
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 1
|-
! 6
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 0
!
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 1
|-
! 7
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 1
!
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 0
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 1
| style="text-align:center;" | 1
|}
Following VPTERNLOGQ intrinsics are declared, where the maskz version sets unmasked destination quad word elements to zero, while the mask version copies unmasked elements from s:
<pre>
__m512i _mm512_ternarylogic_epi64(__m512i a, __m512i b, __m512i c, int imm8);
__m512i _mm512_maskz_ternarylogic_epi64( __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);
__m512i _mm512_mask_ternarylogic_epi64(__m512i s, __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);
</pre>
<span id="VPLZCNT"></span>
==VPLZCNT==
AVX-512CD has Vector [[BitScan#LeadingZeroCount|Leading Zero Count]] - VPLZCNTQ counts leading zeroes on a vector of eight bitboards in parallel <ref>[https://www.google.com/patents/US9372692 Patent US9372692 - Methods, apparatus, instructions, and logic to provide permute controls with leading zero count functionality - Google Patent Search]</ref> - using following intrinsics <ref>[https://hjlebbink.github.io/x86doc/html/VPLZCNTD_Q.html VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values]</ref>, where the maskz version sets unmasked destination elements to zero, while the mask version copies unmasked elements from s:
<pre>
__m512i _mm512_lzcnt_epi64(__m512i a);
__m512i _mm512_maskz_lzcnt_epi64(__mmask8 m, __m512i a);
__m512i _mm512_mask_lzcnt_epi64(__m512i s, __mmask8 m, __m512i a);
</pre>
<span id="VPOPCNT"></span>
==VPOPCNT==
The future AVX-512VPOPCNTDQ extension has a vector [[Population Count|population count]] instruction to count one bits of either 16 32-bit double words (VPOPCNTD) or 8 64-bit quad words aka bitboards (VPOPCNTQ) in parallel <ref>[https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx512-harley-seal.cpp sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub]</ref> <ref>[[Wojciech Muła]], [http://dblp.uni-trier.de/pers/hd/k/Kurz:Nathan Nathan Kurz], [https://github.com/lemire Daniel Lemire] ('''2016'''). ''Faster Population Counts Using AVX2 Instructions''. [https://arxiv.org/abs/1611.07612 arXiv:1611.07612]</ref>.

=See also=
* [[AltiVec]]
* [[AVX]]
* [[AVX2]]
* [[SIMD and SWAR Techniques]]
* [[SSE2]]
* [[XOP]]

=Manuals=
* [https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf)

=External Links=
* [https://en.wikipedia.org/wiki/AVX-512 AVX-512 from Wikipedia]
* [https://www.intel.com/content/www/us/en/architecture-and-technology/avx-512-overview.html Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Overview]
* [https://software.intel.com/en-us/intel-isa-extensions Intel Instruction Set Architecture Extensions]
==Blog Postings==
* [https://software.intel.com/en-us/blogs/2013/avx-512-instructions AVX-512 instructions | Intel® Developer Zone] by [https://software.intel.com/en-us/user/335550 James Reinders], July 23, 2013
* [http://www.agner.org/optimize/blog/read.php?i=288 Future instruction set: AVX-512] by [http://www.agner.org/ Agner Fog], October, 09, 2013
* [https://software.intel.com/en-us/blogs/additional-avx-512-instructions Additional AVX-512 instructions | Intel® Developer Zone] by [https://software.intel.com/en-us/user/335550 James Reinders], July 17, 2014
* [https://software.intel.com/en-us/blogs/2014/07/24/processing-arrays-of-bits-with-intel-advanced-vector-extensions-512-intel-avx-512 Processing Arrays of Bits with Intel® Advanced Vector Extensions 512 (Intel® AVX-512) | Intel® Developer Zone] by [https://software.intel.com/en-us/user/123920 Thomas Willhalm], July 24, 2014
* [https://www.hpcwire.com/2017/06/29/reinders-avx-512-may-hidden-gem-intel-xeon-scalable-processors/ AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors] by [https://software.intel.com/en-us/user/335550 James Reinders], [https://www.hpcwire.com/ HPCwire], June 29, 2017
==Compiler Support==
* [https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX_512 Intel Intrinsics Guide - AVX-512]
* [https://gcc.gnu.org/wiki/cauldron2014?action=AttachFile&do=get&target=Cauldron14_AVX-512_Vector_ISA_Kirill_Yukhin_20140711.pdf Intel® Advanced Vector Extensions 2015/2016 Support in GNU Compiler Collection] (pdf) by [https://www.linkedin.com/in/kirill-yukhin-1158374/ Kirill Yukhin], July 2014
* [https://colfaxresearch.com/knl-avx512/ Guide to Automatic Vectorization with Intel AVX-512 Instructions in Knights Landing Processors - Colfax Research], May 11, 2016
* [https://blogs.msdn.microsoft.com/vcblog/2017/07/11/microsoft-visual-studio-2017-supports-intel-avx-512/ Microsoft Visual Studio 2017 Supports Intel® AVX-512 | Visual C++ Team Blog] by Eric Battalio, July 11, 2017

=References=
<references />

'''[[x86-64|Up one Level]]'''

Navigation menu