Home * Hardware * x86 * AVX2

Advanced Vector Extensions 2 (AVX2) is an expansion of the AVX instruction set. Support for 256-bit expansions of the SSE2 128-bit integer instructions will be added in AVX2, which was along with BMI2 part of Intel's Haswell architecture in 2013, and since 2015, of AMD's Excavator microarchitecture.

Features

Beside expanding most integer AVX instructions to 256 bit, AVX2 has 3-operand general-purpose bit manipulation and multiply, vector shifts, Double- and Quad Word-granularity any-to-any permutes, and 3-operand fused multiply-accumulate support. An important catch is that not all of the instructions are simply generalizations of their 128-bit equivalents: many work "in-lane", applying the same 128-bit operation to each 128-bit half of the register instead of a 256-bit generalization of the operation. For example:

Set	Instruction	Result
AVX	vpunpckldq xmm0, ABCD, EFGH	xmm1 := AEBF
AVX2	vpunpckldq ymm0, ABCDEFGH, IJKLMNOP	xmm1 := AIBJEMFN

If vpunpckldq had been expanded in the more intuitive fashion, the result of the AVX2 operation would be AIBJCKDL. The reason for this design might be to allow AVX to be implemented more easily with two separate 128-bit arithmetic units.

Some AVX2 instructions, such as type conversion instructions, take both xmm and ymm registers as arguments. For example:

Set	Instruction	Operation
AVX	vpmovzxbw xmm1, xmm2/m64	xmm1 := Packed_Zero_Extend_Byte_To_Word(xmm2/m64)
AVX2	vpmovzxbw ymm1, xmm2/m128	ymm1 := Packed_Zero_Extend_Byte_To_Word(xmm2/m128)

Individual Vector Shifts

With AVX2 each data element, such as a bitboard of a quad-bitboard, may be shifted left or right individually, as specified by the second source operand, with following Assembly mnemonics and C intrinsic equivalents:

Instruction	Description	Intrinsic
VPSRLVQ ymm1, ymm2, ymm3/m256	Variable Bit Shift Right Logical	_m256i _mm256_srlv_epi64 (_m256i m, _m256i count)
VPSLLVQ ymm1, ymm2, ymm3/m256	Variable Bit Shift Left Logical	_m256i _mm256_sllv_epi64 (_m256i m, _m256i count)

Applications

With an appropriate quad-bitboard class, one may generate attacks of up to four different directions using individual shifts, for instance knight attacks or sliding piece attacks with Dumb7Fill to generate all positive or negative sliding ray attacks passing two times orthogonal and diagonal sliding pieces.

Code samples and bitboard diagrams rely on Little endian file and rank mapping.

Knight Attacks

        noNoWe    noNoEa
            +15  +17
             |     |
noWeWe  +6 __|     |__+10  noEaEa
              \   /
               >0<
           __ /   \ __
soWeWe -10   |     |   -6  soEaEa
             |     |
            -17  -15
        soSoWe    soSoEa

QBB noEaEa_noNoEa_noNoWe_noWeWe(U64 knights) {
   const QBB qmask  (notAB, notA,notH,notGH);
   const QBB qshift (10,17,15,6);
   QBB qknights (knights);
   return (qknights << qshift) & qmask;
}

QBB soWeWe_soSoWe_soSoEa_soEaEa(U64 knights) {
   const QBB qmask  (notGH,notH,notA,notAB);
   const QBB qshift (10,17,15,6);
   QBB qknights (knights);
   return (qknights >> qshift) & qmask;
}

Dumb7Fill

  northwest    north   northeast
  noWe         nort         noEa
          +7    +8    +9
              \  |  /
  west    -1 <-  0 -> +1    east
              /  |  \
          -9    -8    -7
  soWe         sout         soEa
  southwest    south   southeast

QBB east_nort_noWe_noEa_Attacks(QBB qsliders {rq,rq,bq,bq}, U64 empty) {
   const QBB qmask  (notA,-1,notH,notA);
   const QBB qshift (1,8,7,9);
   QBB qflood (sliders);
   QBB qempty  = QBB(empty) & qmask;
   qflood |= qsliders = (qsliders << qshift) & qempty;
   qflood |= qsliders = (qsliders << qshift) & qempty;
   qflood |= qsliders = (qsliders << qshift) & qempty;
   qflood |= qsliders = (qsliders << qshift) & qempty;
   qflood |= qsliders = (qsliders << qshift) & qempty;
   qflood |=            (qsliders << qshift) & qempty;
   return               (qflood   << qshift) & qmask  
}

QBB west_sout_soEa_soWe_Attacks(QBB qsliders {rq,rq,bq,bq}, U64 empty) {
   const QBB qmask  (notH,-1, notA,notH);
   const QBB qshift (1,8,7,9);
   QBB qflood (sliders);
   QBB qempty  = QBB(empty) & qmask;
   qflood |= qsliders = (qsliders >> qshift) & qempty;
   qflood |= qsliders = (qsliders >> qshift) & qempty;
   qflood |= qsliders = (qsliders >> qshift) & qempty;
   qflood |= qsliders = (qsliders >> qshift) & qempty;
   qflood |= qsliders = (qsliders >> qshift) & qempty;
   qflood |=            (qsliders >> qshift) & qempty;
   return               (qflood   >> qshift) & qmask  
}

Bitboard Permutation

For each bitboard in a destination quad-bitboard, the Qwords Element Permutation (VPERMQ) instruction ^[1] selects one bitboard of a source quad-bitboard. This permits a bitboard in the source operand to be copied to multiple locations in the destination.

 destQBB.bb[0] = sourceQBB.bb[ (imm8 >> 0) & 3 ]
 destQBB.bb[1] = sourceQBB.bb[ (imm8 >> 2) & 3 ]
 destQBB.bb[2] = sourceQBB.bb[ (imm8 >> 4) & 3 ]
 destQBB.bb[3] = sourceQBB.bb[ (imm8 >> 6) & 3 ]

Vertical Nibble

Following code extracts the piece-code as "vertical nibble" from a quad-bitboard as board representation inside a register, "indexed" by square. The idea is to shift the square bits to the leftmost bit, the sign bit of each bitboard, to perform the VPMOVMSKB instruction to get the sign bits of all 32 bytes into a general purpose register. Unfortunately, there is no VPMOVMSKQ to get only the signs of four bitboards, so some more masking and mapping is required to get the four-bit piece code ...

int getPiece (__m256i qbb, U64 sq) {
   __m128i  shift = _mm_cvtsi32x_si128( sq ^ 63 ); /* left shift amount 63-sq */
   qbb  =  _mm256_sll_epi64( qbb, shift ); /* squares to signs */
   uint32   qbbsigns = _mm256_movemask_epi8( qbb );  /* get sign bits of 32 bytes */
   return ((qbbsigns & 0x80808080) * 0x00204081) >> 28; /* mask, nibble-map, shift */
}

... using these intrinsics ...

... with seven assembly instructions intended, assuming the quad-bitboard passed in ymm0 and the square in rcx

  xor       rcx, 63          ; left shift amount 63-sq
  movd      xmm6, rcx        ; shift amount via xmm
  vpsllq    ymm6, ymm0, xmm6 ; squares to signs
  vpmovmskb eax, ymm6        ; get sign bits of 32 bytes
  and       eax, 0x80808080  ; mask the four bitboard sign bits
  imul      eax, 0x00204081  ; map them to the upper nibble
  shr       eax, 28          ; nibble as piece code

SIMD

Publications

Wojciech Muła, Nathan Kurz, Daniel Lemire (2016). Faster Population Counts Using AVX2 Instructions. arXiv:1611.07612 ^[2] » AVX-512, Population Count
Wojciech Muła, Daniel Lemire (2017). Faster Base64 Encoding and Decoding Using AVX2 Instructions. arXiv:1704.00605 » Base64
Mathias Gottschlag, Frank Bellosa (2018). Mechanism to Mitigate AVX-Induced Frequency Reduction. arXiv:1901.04982
Mathias Gottschlag, Philipp Machauer, Yussuf Khalil, Frank Bellosa (2021). Fair Scheduling for AVX2 and AVX-512 Workloads. USENIX ATC '21

Manuals

Forum Posts

Does Hyperthreading have trouble with AVX? by cmylin, Stack Overflow, May 19, 2015 » Thread
Re: Tapered Eval between 4 phases by Youri Matiounine, CCC, October 16, 2017 » Tapered Eval
Re: Ryzen 2 and BMI2? by Joost Buijs, CCC, May 18, 2020 » AMD, BMI2
AVX2 optimized SF+NNUE and processor temperature by corres, CCC, September 05, 2020 » Stockfish NNUE
Regarding AVX2 by Rebel, CCC, November 03, 2021 » NNUE

External Links

Advanced Vector Extensions 2 from Wikipedia
Overview: Intrinsics for Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions | Intel® Software
Intel Intrinsics Guide - AVX2
Intel Software Development Emulator, which can be used to experiment with AVX and AVX2 on a CPU that doesn't support them.
Stop the instruction set war by Agner Fog
Processing Arrays of Bits with Intel® Advanced Vector Extensions 2 (Intel® AVX2) | Intel® Developer Zone by Thomas Willhalm, May 17, 2013
Haswell Instructions Latency

References

↑ _m256i _mm256_permute4x64_epi64(_m256i val, const int control)
↑ sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub

Up one Level

[1] _m256i _mm256_permute4x64_epi64(_m256i val, const int control)

[2] sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub

[1]

[2]

AVX2

Contents

Features

Individual Vector Shifts

Applications

Knight Attacks

Dumb7Fill

Bitboard Permutation

Vertical Nibble

See also

SIMD

Publications

Manuals

Forum Posts

External Links

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools