Changes

Jump to: navigation, search

AVX2

10,764 bytes added, 15:03, 9 August 2018
Created page with "'''Home * Hardware * x86 * AVX2''' '''Advanced Vector Extensions 2''' (AVX2) is an expansion of the AVX instruction set. Support for 256-bit expansi..."
'''[[Main Page|Home]] * [[Hardware]] * [[x86]] * AVX2'''

'''Advanced Vector Extensions 2''' (AVX2) is an expansion of the [[AVX]] instruction set. Support for 256-bit expansions of the [[SSE2]] 128-bit integer instructions will be added in AVX2, which was along with [[BMI2]] part of [[Intel|Intel's]] [https://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29 Haswell] architecture in 2013, and since 2015, of [[AMD|AMD's]] [https://en.wikipedia.org/wiki/Excavator_%28microarchitecture%29 Excavator] microarchitecture.

=Features=
Beside expanding most integer AVX instructions to 256 bit, AVX2 has 3-operand general-purpose bit manipulation and multiply, [[AVX2#IndividualShifts|vector shifts]], [[Double Word|Double]]- and [[Quad Word]]-granularity any-to-any permutes, and 3-operand [https://en.wikipedia.org/wiki/FMA_instruction_set fused multiply-accumulate] support. An important catch is that not all of the instructions are simply generalizations of their 128-bit equivalents: many work "in-lane", applying the same 128-bit operation to each 128-bit half of the register instead of a 256-bit generalization of the operation. For example:
{| class="wikitable"
|-
! Set
! Instruction
! Result
|-
! AVX
| '''vpunpckldq''' xmm0, ABCD, EFGH
| xmm1 := AEBF
|-
! AVX2
| '''vpunpckldq''' ymm0, ABCDEFGH, IJKLMNOP
| xmm1 := AIBJEMFN
|}

If vpunpckldq had been expanded in the more intuitive fashion, the result of the AVX2 operation would be AIBJCKDL. The reason for this design might be to allow AVX to be implemented more easily with two separate 128-bit arithmetic units.

Some AVX2 instructions, such as type conversion instructions, take both xmm and ymm registers as arguments. For example:
{| class="wikitable"
|-
! Set
! Instruction
! Operation
|-
! AVX
| '''vpmovzxbw''' xmm1, xmm2/m64
| xmm1 := Packed_Zero_Extend_Byte_To_Word(xmm2/m64)
|-
! AVX2
| '''vpmovzxbw''' ymm1, xmm2/m128
| ymm1 := Packed_Zero_Extend_Byte_To_Word(xmm2/m128)
|}
<span id="IndividualShifts"></span>
=Individual Vector Shifts=
With AVX2 each data element, such as a [[Bitboards|bitboard]] of a [[Quad-Bitboards|quad-bitboard]], may be shifted left or right individually, as specified by the second source operand, with following [[Assembly]] [https://en.wikipedia.org/wiki/Assembly_language#Opcode_mnemonics_and_extended_mnemonics mnemonics] and [[C]] intrinsic equivalents:
{| class="wikitable"
|-
! Instruction
! Description
! Intrinsic
|-
| '''VPSRLVQ''' ymm1, ymm2, ymm3/m256
| Variable Bit Shift Right Logical
| _m256i [https://software.intel.com/en-us/node/695103 _mm256_srlv_epi64] (_m256i m, _m256i count)
|-
| '''VPSLLVQ''' ymm1, ymm2, ymm3/m256
| Variable Bit Shift Left Logical
| _m256i [https://software.intel.com/en-us/node/695097 _mm256_sllv_epi64] (_m256i m, _m256i count)
|}

=Applications=
With an appropriate [[Quad-Bitboards|quad-bitboard]] class, one may generate attacks of up to four different [[Direction|directions]] using [[AVX2#IndividualShifts|individual shifts]], for instance [[Knight Pattern#Calculation|knight attacks]] or [[Sliding Piece Attacks#Multiple|sliding piece attacks]] with [[Dumb7Fill]] to generate all [[On an empty Board#PositiveRays|positive]] or [[On an empty Board#NegativeRays|negative sliding ray attacks]] passing two times orthogonal and diagonal sliding pieces.

[[include page="MappingHint"]]
<span id="KnightAttacks"></span>
==Knight Attacks==
<pre>
noNoWe noNoEa
+15 +17
| |
noWeWe +6 __| |__+10 noEaEa
\ /
>0<
__ / \ __
soWeWe -10 | | -6 soEaEa
| |
-17 -15
soSoWe soSoEa
</pre>
<pre>
QBB noEaEa_noNoEa_noNoWe_noWeWe(U64 knights) {
const QBB qmask (notAB, notA,notH,notGH);
const QBB qshift (10,17,15,6);
QBB qknights (knights);
return (qknights << qshift) & qmask;
}

QBB soWeWe_soSoWe_soSoEa_soEaEa(U64 knights) {
const QBB qmask (notGH,notH,notA,notAB);
const QBB qshift (10,17,15,6);
QBB qknights (knights);
return (qknights >> qshift) & qmask;
}
</pre>
<span id="Dumb7Fill"></span>
==Dumb7Fill==
<pre>
northwest north northeast
noWe nort noEa
+7 +8 +9
\ | /
west -1 <- 0 -> +1 east
/ | \
-9 -8 -7
soWe sout soEa
southwest south southeast
</pre>
<pre>
QBB east_nort_noWe_noEa_Attacks(QBB qsliders {rq,rq,bq,bq}, U64 empty) {
const QBB qmask (notA,-1,notH,notA);
const QBB qshift (1,8,7,9);
QBB qflood (sliders);
QBB qempty = QBB(empty) & qmask;
qflood |= qsliders = (qsliders << qshift) & qempty;
qflood |= qsliders = (qsliders << qshift) & qempty;
qflood |= qsliders = (qsliders << qshift) & qempty;
qflood |= qsliders = (qsliders << qshift) & qempty;
qflood |= qsliders = (qsliders << qshift) & qempty;
qflood |= (qsliders << qshift) & qempty;
return (qflood << qshift) & qmask
}

QBB west_sout_soEa_soWe_Attacks(QBB qsliders {rq,rq,bq,bq}, U64 empty) {
const QBB qmask (notH,-1, notA,notH);
const QBB qshift (1,8,7,9);
QBB qflood (sliders);
QBB qempty = QBB(empty) & qmask;
qflood |= qsliders = (qsliders >> qshift) & qempty;
qflood |= qsliders = (qsliders >> qshift) & qempty;
qflood |= qsliders = (qsliders >> qshift) & qempty;
qflood |= qsliders = (qsliders >> qshift) & qempty;
qflood |= qsliders = (qsliders >> qshift) & qempty;
qflood |= (qsliders >> qshift) & qempty;
return (qflood >> qshift) & qmask
}
</pre>
<span id="BitboardPermutation"></span>
==Bitboard Permutation==
For each [[Bitboards|bitboard]] in a destination [[Quad-Bitboards|quad-bitboard]], the Qwords Element Permutation ('''VPERMQ''') instruction <ref>_m256i [https://software.intel.com/en-us/node/683670 _mm256_permute4x64_epi64](_m256i val, const int control)</ref> selects one bitboard of a source quad-bitboard. This permits a bitboard in the source operand to be copied to multiple locations in the destination.
<pre>
destQBB.bb[0] = sourceQBB.bb[ (imm8 >> 0) & 3 ]
destQBB.bb[1] = sourceQBB.bb[ (imm8 >> 2) & 3 ]
destQBB.bb[2] = sourceQBB.bb[ (imm8 >> 4) & 3 ]
destQBB.bb[3] = sourceQBB.bb[ (imm8 >> 6) & 3 ]
</pre>
<span id="VerticalNibble"></span>
==Vertical Nibble==
Following code extracts the [[Pieces#PieceCoding|piece-code]] as "[[Quad-Bitboards#getPiece|vertical nibble]]" from a [[Quad-Bitboards|quad-bitboard]] as [[Board Representation|board representation]] inside a register, "indexed" by square. The idea is to shift the square bits to the leftmost bit, the [https://en.wikipedia.org/wiki/Sign_bit sign bit] of each bitboard, to perform the ''VPMOVMSKB'' instruction to get the sign bits of all 32 bytes into a general purpose register. Unfortunately, there is no ''VPMOVMSKQ'' to get only the signs of four bitboards, so some more masking and mapping is required to get the four-bit piece code ...
<pre>
int getPiece (__m256i qbb, U64 sq) {
__m128i shift = _mm_cvtsi32x_si128( sq ^ 63 ); /* left shift amount 63-sq */
qbb = _mm256_sll_epi64( qbb, shift ); /* squares to signs */
uint32 qbbsigns = _mm256_movemask_epi8( qbb ); /* get sign bits of 32 bytes */
return ((qbbsigns & 0x80808080) * 0x00204081) >> 28; /* mask, nibble-map, shift */
}
</pre>
... using these intrinsics ...
* [http://msdn.microsoft.com/en-us/library/6xsd2b20%28v=vs.100%29.aspx _mm_cvtsi64x_si128]
* [https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_sll_epi64&techs=AVX2 _mm256_sll_epi64]
* [https://software.intel.com/en-us/node/695113 _mm256_movemask_epi8]

... with seven [[Assembly|assembly]] instructions intended, assuming the quad-bitboard passed in ymm0 and the square in rcx
<pre>
xor rcx, 63 ; left shift amount 63-sq
movd xmm6, rcx ; shift amount via xmm
vpsllq ymm6, ymm0, xmm6 ; squares to signs
vpmovmskb eax, ymm6 ; get sign bits of 32 bytes
and eax, 0x80808080 ; mask the four bitboard sign bits
imul eax, 0x00204081 ; map them to the upper nibble
shr eax, 28 ; nibble as piece code
</pre>

=See also=
* [[AltiVec]]
* [[AVX]]
* [[AVX-512]]
* [[BMI2]]
* [[DirGolem]]
* [[MMX]]
* [[SIMD and SWAR Techniques]]
* [[SSE]]
* [[SSE2]]
* [[SSE3]]
* [[SSSE3]]
* [[SSE4]]

=Publications=
* [[Wojciech Muła]], [http://dblp.uni-trier.de/pers/hd/k/Kurz:Nathan Nathan Kurz], [https://github.com/lemire Daniel Lemire] ('''2016'''). ''Faster Population Counts Using AVX2 Instructions''. [https://arxiv.org/abs/1611.07612 arXiv:1611.07612] <ref>[https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx512-harley-seal.cpp sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub]</ref> » [[AVX-512]], [[Population Count]]
* [[Wojciech Muła]], [https://github.com/lemire Daniel Lemire] ('''2017'''). ''Faster Base64 Encoding and Decoding Using AVX2 Instructions''. [https://arxiv.org/abs/1704.00605 arXiv:1704.00605] » [https://en.wikipedia.org/wiki/Base64 Base64]

=Manuals=
* [https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf)
* [https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf Intel® 64 and IA-32 Architectures Optimization Reference Manual] (pdf)

=External Links=
* [https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2 Advanced Vector Extensions 2 from Wikipedia]
* [https://software.intel.com/en-us/node/523876 Overview: Intrinsics for Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions | Intel® Software]
* [https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2 Intel Intrinsics Guide - AVX2]
* [http://software.intel.com/en-us/articles/intel-software-development-emulator/ Intel Software Development Emulator], which can be used to experiment with AVX and AVX2 on a CPU that doesn't support them.
* [http://www.agner.org/optimize/blog/read.php?i=25 Stop the instruction set war] by [http://www.agner.org/ Agner Fog]
* [https://software.intel.com/en-us/blogs/2013/05/17/processing-arrays-of-bits-with-intel-advanced-vector-extensions-2-intel-avx2 Processing Arrays of Bits with Intel® Advanced Vector Extensions 2 (Intel® AVX2) | Intel® Developer Zone] by [https://software.intel.com/en-us/user/123920 Thomas Willhalm], May 17, 2013
* [http://users.atw.hu/instlatx64/GenuineIntel00306C3_Haswell_InstLatX64.txt Haswell Instructions Latency]

=References=
<references />

'''[[x86|Up one Level]]'''

Navigation menu