Difference between revisions of "BMI2"

From Chessprogramming wiki
Jump to: navigation, search
(Created page with "'''Home * Hardware * x86-64 * BMI2''' '''BMI2''',<br/> an x86-64 expansion of bit-manipulation instructions by Intel...")
 
 
(16 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
'''BMI2''',<br/>
 
'''BMI2''',<br/>
an x86-64 expansion of [[Bit-Twiddling#BitManipulation|bit-manipulation]] instructions by [[Intel]]. Like [[BMI1]], BMI2 employs [https://en.wikipedia.org/wiki/VEX_prefix VEX prefix] encoding to support three-operand syntax with non-destructive source operands on 32- or 64-bit general-purpose registers. Along with [[AVX2]], BMI2 was expected to be part of [[Intel|Intel's]] [https://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29 Haswell] architecture planned for 2013, but was not yet available in one of the first tested Haswell generations of mid 2013 as reported by Andreas Stiller from the German  c't magazine <ref>[http://de.linkedin.com/pub/andreas-stiller/a/381/aa9 Andreas Stiller] ('''2013'''). ''[http://www.heise.de/ct/inhalt/2013/14/114/ Der Rechenkünstler]''. [http://www.heise.de/ct/ c't Magazin für Computertechnik] 14/2013, p. 114-119 (German)</ref>. BMI2 requires bit 8 set in EBX of [https://en.wikipedia.org/wiki/CPUID CPUID] with EAX=07H, ECX=0H <ref>[http://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family How to detect New Instruction support in the 4th generation Intel® Core™ processor family | Intel® Developer Zone] by [http://software.intel.com/de-de/user/76418 Max Locktyukhin], August 05, 2013</ref>. In 2017, BMI2 was further incorporated in [[AMD|AMD's]] [https://en.wikipedia.org/wiki/Zen_(microarchitecture) Zen-architecture] but so far with a slow implementation of critical instructions such as [[#PDEP|PDEP]] and [[#PEXT|PEXT]] <ref>[http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32016 Ryzen Fritz Chess Benchmarks ?] by ralunger, [[Computer Chess Forums|Rybka Forum]], March 03, 2017</ref> <ref>[https://www.reddit.com/r/Amd/comments/60i6er/ryzen_and_bmi2_strange_behavior_and_high_latencies/ Ryzen and BMI2: Strange behavior and high latencies] by DonnieTinyHands, [https://en.wikipedia.org/wiki/Reddit Reddit], March 20, 2017 » [[AMD]], [[#PEXT|BMI2 PEXT]]</ref> .
+
an x86-64 expansion of [[Bit-Twiddling#BitManipulation|bit-manipulation]] instructions by [[Intel]]. Like [[BMI1]], BMI2 employs [https://en.wikipedia.org/wiki/VEX_prefix VEX prefix] encoding to support three-operand syntax with non-destructive source operands on 32- or 64-bit general-purpose registers. Along with [[AVX2]], BMI2 was expected to be part of [[Intel|Intel's]] [https://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29 Haswell] architecture planned for 2013, but was not yet available in one of the first tested Haswell generations of mid 2013 as reported by Andreas Stiller from the German  c't magazine <ref>[http://de.linkedin.com/pub/andreas-stiller/a/381/aa9 Andreas Stiller] ('''2013'''). ''[http://www.heise.de/ct/inhalt/2013/14/114/ Der Rechenkünstler]''. [http://www.heise.de/ct/ c't Magazin für Computertechnik] 14/2013, p. 114-119 (German)</ref>.  
 +
BMI2 requires bit 8 set in EBX of [https://en.wikipedia.org/wiki/CPUID CPUID] with EAX=07H, ECX=0H <ref>[http://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family How to detect New Instruction support in the 4th generation Intel® Core™ processor family | Intel® Developer Zone] by [http://software.intel.com/de-de/user/76418 Max Locktyukhin], August 05, 2013</ref>.  
 +
In 2017, BMI2 was further incorporated in [[AMD|AMD's]] [https://en.wikipedia.org/wiki/Zen_(microarchitecture) Zen-architecture] but until [https://en.wikipedia.org/wiki/Zen_3 Zen 3] in November 2020 <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75691 Zen3 supports fast PEXT aka BMI2] by [[Alayan Feh]], [[CCC]], November 05, 2020</ref> with a slow implementation of critical instructions such as [[#PDEP|PDEP]] and [[#PEXT|PEXT]] <ref>[http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32016 Ryzen Fritz Chess Benchmarks ?] by ralunger, [[Computer Chess Forums|Rybka Forum]], March 03, 2017</ref> <ref>[https://www.reddit.com/r/Amd/comments/60i6er/ryzen_and_bmi2_strange_behavior_and_high_latencies/ Ryzen and BMI2: Strange behavior and high latencies] by DonnieTinyHands, [https://en.wikipedia.org/wiki/Reddit Reddit], March 20, 2017 » [[AMD]], [[#PEXT|BMI2 PEXT]]</ref>
 +
<ref>[https://www.techpowerup.com/forums/threads/amd-announces-a-red-october-zen-3-on-october-8-rdna2-on-october-28.271981/page-2#post-4344965 Re: AMD Announces a Red October: Zen 3 on October 8, RDNA2 on October 28] by dragontamer5788, [https://www.techpowerup.com/forums/ TechPowerUp Forum], September 9, 2020</ref>.
  
 
=Instructions=
 
=Instructions=
Line 8: Line 11:
 
<span id="BZHI"></span>
 
<span id="BZHI"></span>
 
==BZHI==
 
==BZHI==
Zero High Bits Starting with Specified Bit Position <ref>[http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/intref_cls/common/intref_avx2_bzhi_u.htm _bzhi_u32/u64]</ref>.
+
Zero High Bits Starting with Specified Bit Position.
 
<pre>
 
<pre>
 
dest ::= src & ((1 << index)-1);
 
dest ::= src & ((1 << index)-1);
Line 16: Line 19:
 
<span id="PDEP"></span>
 
<span id="PDEP"></span>
 
==PDEP==
 
==PDEP==
Parallel Bits Deposit <ref>[http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/intref_cls/common/intref_avx2_pdep_u.htm _pdep_u32/u64]</ref>. May be used to map [[First Rank Attacks|first rank attacks]] to any rank, file, or diagonal on the chess board.
+
Parallel Bits Deposit. May be used to map [[First Rank Attacks|first rank attacks]] to any rank, file, or diagonal on the chess board.
  
 
===Intrinsic Prototype===
 
===Intrinsic Prototype===
Line 51: Line 54:
 
<span id="PEXT"></span>
 
<span id="PEXT"></span>
 
==PEXT==
 
==PEXT==
Parallel Bits Extract <ref> [http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/intref_cls/common/intref_avx2_pext_u.htm _pext_u32/u64]</ref>. Great to get the [[First Rank Attacks#TheOuterSquares|inner six bit]] occupancy of any rank, file, or diagonal as index of six consecutive lower bit, or the up to 12 bits for a [[#PEXTBitboards|magic attack lookup]].
+
Parallel Bits Extract. Great to get the [[First Rank Attacks#TheOuterSquares|inner six bit]] occupancy of any rank, file, or diagonal as index of six consecutive lower bit, or the up to 12 bits for a [[#PEXTBitboards|magic attack lookup]].
  
 
===Intrinsic Prototype===
 
===Intrinsic Prototype===
Line 107: Line 110:
 
An idea for denser sliding attack tables is also using [[#PDEP|PDEP]], dividing the table size of the [[#PEXTBitboards|PEXT Bitboards]] by four but some trailing computation. Looking up [[Word|16-bit words]] instead of bitboards - to get the scattered bits of the up to four ray attacks per square in an extracted form mapped to a 16-bit word, which then becomes the attack bitboard by performing a deposit instruction with an appropriate second mask per square. The technique was introduced by [[Zach Wegner]] in [[CCC]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=40333 Haswell New Instructions] by [[Zach Wegner]], [[CCC]], September 09, 2011</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=40333&start=12 Re: Haswell New Instructions] by [[Gerd Isenberg]], [[CCC]], September 12, 2011</ref> and implemented by [[Ronald de Man]] with a table of slightly more than 210 KiB <ref>[http://www.talkchess.com/forum/viewtopic.php?t=49611&start=1 Re: 152k rook and bishop attacks using PEXT and PDEP] by [[Ronald de Man]], [[CCC]], October 06, 2013</ref> <ref>[https://github.com/syzygy1/tb/blob/master/src/bmi2.h tb/src/bmi2.h at master · syzygy1/tb · GitHub] by [[Ronald de Man]]</ref>. Ronald's code was  tested by [[Jean-Francois Romang]] using [[Stockfish]] as testbed, giving promising results  <ref>[http://www.talkchess.com/forum/viewtopic.php?t=51879&start=2 Re: Stockfish haswell optimized build] by [[Jean-Francois Romang]], [[CCC]], April 06, 2014</ref> <ref>[https://github.com/jromang/Stockfish/blob/bmi2/src/bitboard.h#L255 Stockfish/src/bitboard.h at bmi2 · jromang/Stockfish · GitHub] by  [[Jean-Francois Romang]]</ref>.  
 
An idea for denser sliding attack tables is also using [[#PDEP|PDEP]], dividing the table size of the [[#PEXTBitboards|PEXT Bitboards]] by four but some trailing computation. Looking up [[Word|16-bit words]] instead of bitboards - to get the scattered bits of the up to four ray attacks per square in an extracted form mapped to a 16-bit word, which then becomes the attack bitboard by performing a deposit instruction with an appropriate second mask per square. The technique was introduced by [[Zach Wegner]] in [[CCC]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=40333 Haswell New Instructions] by [[Zach Wegner]], [[CCC]], September 09, 2011</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=40333&start=12 Re: Haswell New Instructions] by [[Gerd Isenberg]], [[CCC]], September 12, 2011</ref> and implemented by [[Ronald de Man]] with a table of slightly more than 210 KiB <ref>[http://www.talkchess.com/forum/viewtopic.php?t=49611&start=1 Re: 152k rook and bishop attacks using PEXT and PDEP] by [[Ronald de Man]], [[CCC]], October 06, 2013</ref> <ref>[https://github.com/syzygy1/tb/blob/master/src/bmi2.h tb/src/bmi2.h at master · syzygy1/tb · GitHub] by [[Ronald de Man]]</ref>. Ronald's code was  tested by [[Jean-Francois Romang]] using [[Stockfish]] as testbed, giving promising results  <ref>[http://www.talkchess.com/forum/viewtopic.php?t=51879&start=2 Re: Stockfish haswell optimized build] by [[Jean-Francois Romang]], [[CCC]], April 06, 2014</ref> <ref>[https://github.com/jromang/Stockfish/blob/bmi2/src/bitboard.h#L255 Stockfish/src/bitboard.h at bmi2 · jromang/Stockfish · GitHub] by  [[Jean-Francois Romang]]</ref>.  
  
 +
==FEN Compression==
 +
Concerning the application of [[Neural Networks|neural network]] training, in particular [[NNUE #Training|NNUE training]] with millions of [[Chess Position|chess position]] inside a huge file of [[Forsyth-Edwards Notation|FEN]] records,
 +
Lucas Braesch proposed the idea to [https://en.wikipedia.org/wiki/Data_compression compress] a chess position by a sequence of [[#PEXT|PEXT]] instructions,
 +
while the decompression is done by an sequence of corresponding [[#PDEP|PDEP]] instructions <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76892&start=3 Re: FEN compression] by lucasart, [[CCC]], March 17, 2021 » [[Forsyth-Edwards Notation]], [[NNUE #Training|NNUE Training]]</ref>.
 +
This is intended as possible alternative of the [https://en.wikipedia.org/wiki/Huffman_coding Huffman coding] as applied in the [[Gary Linscott#PyTorch NNUE|PyTorch NNUE]] code by [[Gary Linscott]] <ref>[https://github.com/glinscott/nnue-pytorch/blob/master/lib/nnue_training_data_formats.h#L6405 nnue-pytorch/nnue_training_data_formats.h at master · glinscott/nnue-pytorch · GitHub]</ref>. The table illustrates the layout:
 +
{| class="wikitable"
 +
! Item
 +
! Bits
 +
! Definition
 +
|-
 +
| occupancy rocc<br/>(remaining occupancy)
 +
| style="text-align:right;" | 64
 +
| style="text-align:center;" | <nowiki>rocc = white | black </nowiki>
 +
|-
 +
| pext(white, rocc)
 +
| style="text-align:right;" | cnt(rocc)
 +
|
 +
|-
 +
| black = rocc ^ white
 +
| style="text-align:right;" | 0
 +
|
 +
|-
 +
| pext(pawns, rocc)
 +
| style="text-align:right;" | cnt(rocc)
 +
|
 +
|-
 +
| pext(knights, rocc)
 +
| style="text-align:right;" | cnt(rocc)
 +
| style="text-align:center;" | rocc ^= pawns
 +
|-
 +
| pext(bishops, rocc)
 +
| style="text-align:right;" | cnt(rocc)
 +
| style="text-align:center;" | rocc ^= bishops
 +
|-
 +
| pext(rooks, rocc)
 +
| style="text-align:right;" | cnt(rocc)
 +
| style="text-align:center;" | rocc ^= rooks       
 +
|-
 +
| pext(queens, rocc)
 +
| style="text-align:right;" | cnt(rocc)
 +
| style="text-align:center;" | rocc ^= queens
 +
|-
 +
| kings = rocc
 +
| style="text-align:right;" | 0
 +
|
 +
|-
 +
| side to move
 +
| style="text-align:right;" | 1
 +
|
 +
|-
 +
| ep
 +
| style="text-align:right;" | 1
 +
|
 +
|-
 +
| pext(epSquare, candis)
 +
| style="text-align:right;" | cnt(candis)
 +
| style="text-align:right;" | candis = ((white & pawns & rank4) << 8)<br/> | ((black & pawns & rank6) >> 8)
 +
|-
 +
| pext(castler, rooks)
 +
| style="text-align:right;" | cnt(rooks)
 +
| style="text-align:center;" | castler = rooks with castling rights<br/>(any color)
 +
|-
 +
| rule50
 +
| style="text-align:right;" | 7
 +
|
 +
|}
 +
The uncompressed piece placement bitstream is based on the [[Bitboard Board-Definition#SixTwo|dense bitboard board-definition]]
 +
and consists of the [[Occupancy|occupancy]], one color bitboard (e.g. White pieces),
 +
and subsequently the piece bitboards of both colors, from pawns, knights, bishops, rooks until queens. King bitboards are redundant,
 +
due to the subsequent [[General Setwise Operations#RelativeComplement|relative complement]] of a remaining occupancy with all mentioned piece bitboards. 
 +
Instead of storing the sparsely populated piece bitboards, the compression is due to saving their extracted bits from the remaining occupancy - 
 +
with bit size of [[Population Count|population count]] of the remaining occupancy, rather than 64.
 +
The pseudo code below omits further aspects of a chess position, such as side to move, castling ability, en passant target square and halfmove clock.
 +
 +
<pre>
 +
BitStream compress(const U64 * pieceBB, ...) {
 +
  BitStream stream;
 +
  U64 rocc = pieceBB[nWhite] | pieceBB[nBlack];
 +
  stream.write_n_bit(rocc, 64);
 +
  stream.write_n_bit(_pext_u64(pieceBB[nWhite], rocc), popCount(rocc));
 +
  for (pt = nPawn; pt <= nQueen; ++pt) {
 +
    stream.write_n_bit(_pext_u64(pieceBB[pt], rocc), popCount(rocc));
 +
    rocc ^= pieceBB[pt];
 +
  }
 +
  assert (rocc == pieceBB[nKing]);
 +
  // side to move, ep and castling omitted
 +
  return stream;
 +
}
 +
 +
void decompress (const BitStream &stream, U64 * pieceBB, ...) {
 +
  U64 rocc = stream.read_n_bit(64);
 +
  pieceBB[nWhite] = _pdep_u64(stream.read_n_bit(popCount(rocc)), rocc);
 +
  pieceBB[nBlack] = rocc ^ pieceBB[nWhite];
 +
  for (pt = nPawn; pt <= nKing; ++pt) {
 +
    pieceBB[pt] = _pdep_u64(stream.read_n_bit(popCount(rocc)), rocc);
 +
    rocc ^= pieceBB[pt];
 +
  }
 +
  assert (rocc == 0);
 +
  // side to move, ep and castling omitted
 +
}
 +
</pre>
 
==Syzygy Generator==
 
==Syzygy Generator==
 
[[Ronald de Man|Ronald de Man's]] generator for [[Syzygy Bases]] can take profit of [[#PDEP|PDEP]] and [[#PEXT|PEXT]] instructions, or to use their serial implementations to further reduce the size of his already compact [[Endgame Tablebases|endgame tablebases]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48220&start=1 Re: PEXT Bitboards] by [[Ronald de Man]], [[CCC]], June 07, 2013</ref>.
 
[[Ronald de Man|Ronald de Man's]] generator for [[Syzygy Bases]] can take profit of [[#PDEP|PDEP]] and [[#PEXT|PEXT]] instructions, or to use their serial implementations to further reduce the size of his already compact [[Endgame Tablebases|endgame tablebases]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48220&start=1 Re: PEXT Bitboards] by [[Ronald de Man]], [[CCC]], June 07, 2013</ref>.
 +
<span id="PEXTPDEPProposal"></span>
 +
=Early PEXT/PDEP Proposal=
 +
In late 2006, [[Michael Sherwin]] already proposed a '''PEXTPDEP''' instruction, [[#PEXT|Parallel Bits Extract]] controlled by a source mask followed by [[#PDEP|Parallel Bits Deposit]] controlled by a destination mask <ref>[http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5962 New instruction that intel/amd should add] by [[Michael Sherwin]], [[Computer Chess Forums|Winboard Forum]], December 05, 2006</ref> <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75333&start=8 Re: New AMD Zen 3 and Ryzen processors] by [[Michael Sherwin|Mike Sherwin]], [[CCC]], October 11, 2020</ref>.
 +
However, it is not known whether his proposal was recognized by Intel engineers and had any influence on the design of the BMI2 PEXT and PDEP instructions.
  
 
=See also=
 
=See also=
Line 118: Line 226:
 
* [[General Setwise Operations]]
 
* [[General Setwise Operations]]
 
* [[TBM]]
 
* [[TBM]]
 
=Manuals=
 
* [http://software.intel.com/file/36945 Intel AVX and AVX2 Programming Reference] (pdf)
 
  
 
=Forum Posts=
 
=Forum Posts=
 +
==2006==
 +
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5962 New instruction that intel/amd should add] by [[Michael Sherwin]], [[Computer Chess Forums|Winboard Forum]], December 05, 2006 » [[#PEXTPDEPProposal|Early PEXT/PDEP Proposal]]
 +
==2011 ...==
 
* [http://www.talkchess.com/forum/viewtopic.php?t=40333 Haswell New Instructions] by [[Zach Wegner]], [[CCC]], September 09, 2011
 
* [http://www.talkchess.com/forum/viewtopic.php?t=40333 Haswell New Instructions] by [[Zach Wegner]], [[CCC]], September 09, 2011
 
* [http://www.randombit.net/bitbashing/2012/06/22/haswell_bit_permutations.html Bit manipulations using BMI2] by [http://www.randombit.net/ Jack Lloyd], [http://www.randombit.net/ / :: bitbashing], June 22, 2012
 
* [http://www.randombit.net/bitbashing/2012/06/22/haswell_bit_permutations.html Bit manipulations using BMI2] by [http://www.randombit.net/ Jack Lloyd], [http://www.randombit.net/ / :: bitbashing], June 22, 2012
 
* [http://www.talkchess.com/forum/viewtopic.php?t=48220 PEXT Bitboards] by [[Gerd Isenberg]], [[CCC]], June 07, 2013
 
* [http://www.talkchess.com/forum/viewtopic.php?t=48220 PEXT Bitboards] by [[Gerd Isenberg]], [[CCC]], June 07, 2013
 
* [http://www.talkchess.com/forum/viewtopic.php?t=49611 152k rook and bishop attacks using PEXT and PDEP] by [[Lasse Hansen]], [[CCC]], October 06, 2013
 
* [http://www.talkchess.com/forum/viewtopic.php?t=49611 152k rook and bishop attacks using PEXT and PDEP] by [[Lasse Hansen]], [[CCC]], October 06, 2013
* [http://www.talkchess.com/forum/viewtopic.php?t=51879 Stockfish haswell optimized build]  by [[Jean-Francois Romang]], [[CCC]], April 06, 2014 » [[Stockfish]]  
+
* [http://www.talkchess.com/forum/viewtopic.php?t=51879 Stockfish haswell optimized build]  by [[Jean-Francois Romang]], [[CCC]], April 06, 2014 » [[Stockfish]]
 +
==2015 ...==
 
* [http://www.talkchess.com/forum/viewtopic.php?t=58910 smaller tables for PEXT-style attack getters] by [[Wylie Garvin]], [[CCC]], January 13, 2016
 
* [http://www.talkchess.com/forum/viewtopic.php?t=58910 smaller tables for PEXT-style attack getters] by [[Wylie Garvin]], [[CCC]], January 13, 2016
 
* [http://www.talkchess.com/forum/viewtopic.php?t=63978 BMI2 intrinsics in gcc] by [[Álvaro Begué]], [[CCC]], May 14, 2017
 
* [http://www.talkchess.com/forum/viewtopic.php?t=63978 BMI2 intrinsics in gcc] by [[Álvaro Begué]], [[CCC]], May 14, 2017
Line 134: Line 243:
 
* [http://www.talkchess.com/forum/viewtopic.php?t=65191 BMI2 PEXT idea for attacks generation] by [[Mark Lefler]], [[CCC]], September 16, 2017
 
* [http://www.talkchess.com/forum/viewtopic.php?t=65191 BMI2 PEXT idea for attacks generation] by [[Mark Lefler]], [[CCC]], September 16, 2017
 
* [http://www.talkchess.com/forum/viewtopic.php?t=66737&start=4 Re: Komodo 11.3] by [[Mark Lefler]], [[CCC]], March 04, 2018 » [[AMD]], [[#PEXT|BMI2 PEXT]], [[Komodo#11|Komodo 11.3]]
 
* [http://www.talkchess.com/forum/viewtopic.php?t=66737&start=4 Re: Komodo 11.3] by [[Mark Lefler]], [[CCC]], March 04, 2018 » [[AMD]], [[#PEXT|BMI2 PEXT]], [[Komodo#11|Komodo 11.3]]
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67432 Ryzen 2 and BMI2?] by [[Steve Maughan]], [[CCC]], May 13, 2018 » [[AMD]]
 +
: [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67432&start=12 Re: Ryzen 2 and BMI2?] by [[Joost Buijs]], [[CCC]], May 18, 2020 » [[AVX2]]
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70167 AMD BMI2 question] by Leo, [[CCC]], March 10, 2019 » [[AMD]]
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=72538 PEXT/PDEP are even slower than you think on Zen] by [[Gian-Carlo Pascutto]], [[CCC]], December 09, 2019 » [[AMD]]
 +
==2020 ...==
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67432&start=12 Re: Ryzen 2 and BMI2?] by [[Joost Buijs]], [[CCC]], May 18, 2020 » [[AMD]], [[AVX2]]
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75333 New AMD Zen 3 and Ryzen processors] by mmt, [[CCC]], October 09, 2020
 +
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75333&start=8 Re: New AMD Zen 3 and Ryzen processors] by [[Michael Sherwin|Mike Sherwin]], [[CCC]], October 11, 2020 » [[#PEXTPDEPProposal|Early PEXT/PDEP Proposal]]
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75691 Zen3 supports fast PEXT aka BMI2] by [[Alayan Feh]], [[CCC]], November 05, 2020 » [[AMD]]
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76892 FEN compression] by lucasart, [[CCC]], March 17, 2021 » [[#FEN Compression|FEN Compression]], [[NNUE #Training|NNUE Training]]
 +
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=78804 pdep/pext for 128-bit integers and bit-arrays] by [[Rein Halbersma]], [[CCC]], December 03, 2021
  
 
=External Links=
 
=External Links=
 
* [https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets Bit Manipulation Instruction Sets from Wikipedia]
 
* [https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets Bit Manipulation Instruction Sets from Wikipedia]
* [http://semipublic.comp-arch.net/wiki/Compress_Bits_Under_Mask Compress Bits Under Mask - CompArch]
 
* [http://software.intel.com/en-us/tags/36092 BMI2 | Intel® Developer Zone]
 
 
* [http://www.randombit.net/bitbashing/2012/06/22/haswell_bit_permutations.html Bit manipulations using BMI2 — bitbashing], June 22, 2012
 
* [http://www.randombit.net/bitbashing/2012/06/22/haswell_bit_permutations.html Bit manipulations using BMI2 — bitbashing], June 22, 2012
* [http://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family How to detect New Instruction support in the 4th generation Intel® Core™ processor family | Intel® Developer Zone] by [http://software.intel.com/de-de/user/76418 Max Locktyukhin], August 05, 2013
 
 
* [http://users.atw.hu/instlatx64/GenuineIntel00306C3_Haswell_InstLatX64.txt Haswell Instructions Latency]
 
* [http://users.atw.hu/instlatx64/GenuineIntel00306C3_Haswell_InstLatX64.txt Haswell Instructions Latency]
 
* [http://software.intel.com/sites/landingpage/IntrinsicsGuide/ Intel Intrinsics Guide]
 
* [http://software.intel.com/sites/landingpage/IntrinsicsGuide/ Intel Intrinsics Guide]
Line 146: Line 263:
 
=References=
 
=References=
 
<references />
 
<references />
 
 
'''[[x86-64|Up one Level]]'''
 
'''[[x86-64|Up one Level]]'''

Latest revision as of 13:09, 3 December 2021

Home * Hardware * x86-64 * BMI2

BMI2,
an x86-64 expansion of bit-manipulation instructions by Intel. Like BMI1, BMI2 employs VEX prefix encoding to support three-operand syntax with non-destructive source operands on 32- or 64-bit general-purpose registers. Along with AVX2, BMI2 was expected to be part of Intel's Haswell architecture planned for 2013, but was not yet available in one of the first tested Haswell generations of mid 2013 as reported by Andreas Stiller from the German c't magazine [1]. BMI2 requires bit 8 set in EBX of CPUID with EAX=07H, ECX=0H [2]. In 2017, BMI2 was further incorporated in AMD's Zen-architecture but until Zen 3 in November 2020 [3] with a slow implementation of critical instructions such as PDEP and PEXT [4] [5] [6].

Instructions

Beside the instructions explained in more detail below, there is MULX, Unsigned Multiply, and RORX, SARX/SHLX/SHRX, that is rotate and shifts without affecting processor flags. PEXT and PDEP [7] allow for an alternative of kindergarten- or magic bitboards, and clearly makes maintaining rotated bitboards obsolete [8].

BZHI

Zero High Bits Starting with Specified Bit Position.

dest ::= src & ((1 << index)-1);

unsigned __int64 _bzhi_u64(unsigned __int64 src, unsigned __int32 index);

PDEP

Parallel Bits Deposit. May be used to map first rank attacks to any rank, file, or diagonal on the chess board.

Intrinsic Prototype

unsigned __int64 pdep_u64(unsigned __int64 src, unsigned __int64 mask);

Sample

SRC1   ┌───┬───┬───┬───┬───┐    ┌───┬───┬───┬───┬───┬───┬───┬───┐
       │S63│S62│S61│S60│S59│....│ S7│ S6│ S5│ S4│ S3│ S2│ S1│ S0│ 
       └───┴───┴───┴───┴───┘    └───┴───┴───┴───┴───┴───┴───┴───┘

SRC2   ┌───┬───┬───┬───┬───┐    ┌───┬───┬───┬───┬───┬───┬───┬───┐
(mask) │ 0 │ 0 │ 0 │ 1 │ 0 │0...│ 1 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │ 0 │  (f.i. 4 bits set)
       └───┴───┴───┴───┴───┘    └───┴───┴───┴───┴───┴───┴───┴───┘

DEST   ┌───┬───┬───┬───┬───┐    ┌───┬───┬───┬───┬───┬───┬───┬───┐
       │ 0 │ 0 │ 0 │ S3│ 0 │0...│ S2│ 0 │ S1│ 0 │ 0 │ S0│ 0 │ 0 │ 
       └───┴───┴───┴───┴───┘    └───┴───┴───┴───┴───┴───┴───┴───┘

Serial Implementation

in C [9]:

U64 _pdep_u64(U64 val, U64 mask) {
  U64 res = 0;
  for (U64 bb = 1; mask; bb += bb) {
    if (val & bb)
      res |= mask & -mask;
    mask &= mask - 1;
  }
  return res;
}

PEXT

Parallel Bits Extract. Great to get the inner six bit occupancy of any rank, file, or diagonal as index of six consecutive lower bit, or the up to 12 bits for a magic attack lookup.

Intrinsic Prototype

unsigned __int64 _pext_u64(unsigned __int64 src, unsigned __int64 mask);

Sample

SRC1   ┌───┬───┬───┬───┬───┐    ┌───┬───┬───┬───┬───┬───┬───┬───┐
       │S63│S62│S61│S60│S59│....│ S7│ S6│ S5│ S4│ S3│ S2│ S1│ S0│ 
       └───┴───┴───┴───┴───┘    └───┴───┴───┴───┴───┴───┴───┴───┘

SRC2   ┌───┬───┬───┬───┬───┐    ┌───┬───┬───┬───┬───┬───┬───┬───┐
(mask) │ 0 │ 0 │ 0 │ 1 │ 0 │0...│ 1 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │ 0 │  (f.i. 4 bits set)
       └───┴───┴───┴───┴───┘    └───┴───┴───┴───┴───┴───┴───┴───┘

DEST   ┌───┬───┬───┬───┬───┐    ┌───┬───┬───┬───┬───┬───┬───┬───┐
       │ 0 │ 0 │ 0 │ 0 │ 0 │0...│ 0 │ 0 │ 0 │ 0 │S60│ S7│ S5│ S2│ 
       └───┴───┴───┴───┴───┘    └───┴───┴───┴───┴───┴───┴───┴───┘

Serial Implementation

in C [10], quite similar to PDEP in traversing the mask, and only two expressions, "mask & -mask" and "bb" swapped:

U64 _pext_u64(U64 val, U64 mask) {
  U64 res = 0;
  for (U64 bb = 1; mask; bb += bb) {
    if ( val & mask & -mask )
      res |= bb;
    mask &= mask - 1;
  }
  return res;
} 

Applications

PEXT Bitboards

Fancy PEXT Bitboards as replacement for Fancy Magic Bitboards. The relevant up to four ray occupancies are mapped to a dense index range by using the PEXT instruction:

U64 arrAttacks  [...]; // ~840 KByte all rook and bishop attacks
U64 arrRookMask  [64]; // 10..12 relevant occupancy bits per rook square
U64 arrBishopMask[64]; //  5.. 9 relevant occupancy bits per bishop square
U32 arrRookBase  [64]; // arrAttacks base offset per rook square
U32 arrBishopBase[64]; // arrAttacks base offset per bishop square

U64 rookAttack(U64 occ, enumSquare sq) {
  return arrAttacks[arrRookBase[sq] + _pext_u64(occ, arrRookMask[sq])];
}

U64 bishopAttack(U64 occ, enumSquare sq) {
  return arrAttacks[arrBishopBase[sq] + _pext_u64(occ, arrBishopMask[sq])];
}

PEXT/PDEP Bitboards

An idea for denser sliding attack tables is also using PDEP, dividing the table size of the PEXT Bitboards by four but some trailing computation. Looking up 16-bit words instead of bitboards - to get the scattered bits of the up to four ray attacks per square in an extracted form mapped to a 16-bit word, which then becomes the attack bitboard by performing a deposit instruction with an appropriate second mask per square. The technique was introduced by Zach Wegner in CCC [11] [12] and implemented by Ronald de Man with a table of slightly more than 210 KiB [13] [14]. Ronald's code was tested by Jean-Francois Romang using Stockfish as testbed, giving promising results [15] [16].

FEN Compression

Concerning the application of neural network training, in particular NNUE training with millions of chess position inside a huge file of FEN records, Lucas Braesch proposed the idea to compress a chess position by a sequence of PEXT instructions, while the decompression is done by an sequence of corresponding PDEP instructions [17]. This is intended as possible alternative of the Huffman coding as applied in the PyTorch NNUE code by Gary Linscott [18]. The table illustrates the layout:

Item Bits Definition
occupancy rocc
(remaining occupancy)
64 rocc = white | black
pext(white, rocc) cnt(rocc)
black = rocc ^ white 0
pext(pawns, rocc) cnt(rocc)
pext(knights, rocc) cnt(rocc) rocc ^= pawns
pext(bishops, rocc) cnt(rocc) rocc ^= bishops
pext(rooks, rocc) cnt(rocc) rocc ^= rooks
pext(queens, rocc) cnt(rocc) rocc ^= queens
kings = rocc 0
side to move 1
ep 1
pext(epSquare, candis) cnt(candis) candis = ((white & pawns & rank4) << 8)
| ((black & pawns & rank6) >> 8)
pext(castler, rooks) cnt(rooks) castler = rooks with castling rights
(any color)
rule50 7

The uncompressed piece placement bitstream is based on the dense bitboard board-definition and consists of the occupancy, one color bitboard (e.g. White pieces), and subsequently the piece bitboards of both colors, from pawns, knights, bishops, rooks until queens. King bitboards are redundant, due to the subsequent relative complement of a remaining occupancy with all mentioned piece bitboards. Instead of storing the sparsely populated piece bitboards, the compression is due to saving their extracted bits from the remaining occupancy - with bit size of population count of the remaining occupancy, rather than 64. The pseudo code below omits further aspects of a chess position, such as side to move, castling ability, en passant target square and halfmove clock.

BitStream compress(const U64 * pieceBB, ...) {
  BitStream stream;
  U64 rocc = pieceBB[nWhite] | pieceBB[nBlack];
  stream.write_n_bit(rocc, 64);
  stream.write_n_bit(_pext_u64(pieceBB[nWhite], rocc), popCount(rocc));
  for (pt = nPawn; pt <= nQueen; ++pt) { 
    stream.write_n_bit(_pext_u64(pieceBB[pt], rocc), popCount(rocc));
    rocc ^= pieceBB[pt]; 
  }
  assert (rocc == pieceBB[nKing]);
  // side to move, ep and castling omitted
  return stream;
}

void decompress (const BitStream &stream, U64 * pieceBB, ...) {
  U64 rocc = stream.read_n_bit(64);
  pieceBB[nWhite] = _pdep_u64(stream.read_n_bit(popCount(rocc)), rocc);
  pieceBB[nBlack] = rocc ^ pieceBB[nWhite];
  for (pt = nPawn; pt <= nKing; ++pt) { 
    pieceBB[pt] = _pdep_u64(stream.read_n_bit(popCount(rocc)), rocc);
    rocc ^= pieceBB[pt]; 
  }
  assert (rocc == 0);
  // side to move, ep and castling omitted
}

Syzygy Generator

Ronald de Man's generator for Syzygy Bases can take profit of PDEP and PEXT instructions, or to use their serial implementations to further reduce the size of his already compact endgame tablebases [19].

Early PEXT/PDEP Proposal

In late 2006, Michael Sherwin already proposed a PEXTPDEP instruction, Parallel Bits Extract controlled by a source mask followed by Parallel Bits Deposit controlled by a destination mask [20] [21]. However, it is not known whether his proposal was recognized by Intel engineers and had any influence on the design of the BMI2 PEXT and PDEP instructions.

See also

Forum Posts

2006

2011 ...

2015 ...

Re: Ryzen 2 and BMI2? by Joost Buijs, CCC, May 18, 2020 » AVX2

2020 ...

Re: New AMD Zen 3 and Ryzen processors by Mike Sherwin, CCC, October 11, 2020 » Early PEXT/PDEP Proposal

External Links

References

  1. Andreas Stiller (2013). Der Rechenkünstler. c't Magazin für Computertechnik 14/2013, p. 114-119 (German)
  2. How to detect New Instruction support in the 4th generation Intel® Core™ processor family | Intel® Developer Zone by Max Locktyukhin, August 05, 2013
  3. Zen3 supports fast PEXT aka BMI2 by Alayan Feh, CCC, November 05, 2020
  4. Ryzen Fritz Chess Benchmarks ? by ralunger, Rybka Forum, March 03, 2017
  5. Ryzen and BMI2: Strange behavior and high latencies by DonnieTinyHands, Reddit, March 20, 2017 » AMD, BMI2 PEXT
  6. Re: AMD Announces a Red October: Zen 3 on October 8, RDNA2 on October 28 by dragontamer5788, TechPowerUp Forum, September 9, 2020
  7. Ruby Lee, Zhijie Shi, Xiao Yang (2001). Efficient Permutation Instructions for Fast Software Cryptography. Micro, IEEE, Vol. 21, No. 6, pdf
  8. Haswell New Instructions by Zach Wegner, CCC, September 09, 2011
  9. Re: PEXT Bitboards by Ronald de Man, CCC, June 07, 2013
  10. Re: PEXT Bitboards by Gerd Isenberg, CCC, June 07, 2013
  11. Haswell New Instructions by Zach Wegner, CCC, September 09, 2011
  12. Re: Haswell New Instructions by Gerd Isenberg, CCC, September 12, 2011
  13. Re: 152k rook and bishop attacks using PEXT and PDEP by Ronald de Man, CCC, October 06, 2013
  14. tb/src/bmi2.h at master · syzygy1/tb · GitHub by Ronald de Man
  15. Re: Stockfish haswell optimized build by Jean-Francois Romang, CCC, April 06, 2014
  16. Stockfish/src/bitboard.h at bmi2 · jromang/Stockfish · GitHub by Jean-Francois Romang
  17. Re: FEN compression by lucasart, CCC, March 17, 2021 » Forsyth-Edwards Notation, NNUE Training
  18. nnue-pytorch/nnue_training_data_formats.h at master · glinscott/nnue-pytorch · GitHub
  19. Re: PEXT Bitboards by Ronald de Man, CCC, June 07, 2013
  20. New instruction that intel/amd should add by Michael Sherwin, Winboard Forum, December 05, 2006
  21. Re: New AMD Zen 3 and Ryzen processors by Mike Sherwin, CCC, October 11, 2020

Up one Level