Changes

Jump to: navigation, search

BMI2

6,097 bytes added, 15:24, 23 June 2021
no edit summary
'''BMI2''',<br/>
an x86-64 expansion of [[Bit-Twiddling#BitManipulation|bit-manipulation]] instructions by [[Intel]]. Like [[BMI1]], BMI2 employs [https://en.wikipedia.org/wiki/VEX_prefix VEX prefix] encoding to support three-operand syntax with non-destructive source operands on 32- or 64-bit general-purpose registers. Along with [[AVX2]], BMI2 was expected to be part of [[Intel|Intel's]] [https://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29 Haswell] architecture planned for 2013, but was not yet available in one of the first tested Haswell generations of mid 2013 as reported by Andreas Stiller from the German c't magazine <ref>[http://de.linkedin.com/pub/andreas-stiller/a/381/aa9 Andreas Stiller] ('''2013'''). ''[http://www.heise.de/ct/inhalt/2013/14/114/ Der Rechenkünstler]''. [http://www.heise.de/ct/ c't Magazin für Computertechnik] 14/2013, p. 114-119 (German)</ref>. BMI2 requires bit 8 set in EBX of [https://en.wikipedia.org/wiki/CPUID CPUID] with EAX=07H, ECX=0H <ref>[http://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family How to detect New Instruction support in the 4th generation Intel® Core™ processor family | Intel® Developer Zone] by [http://software.intel.com/de-de/user/76418 Max Locktyukhin], August 05, 2013</ref>. In 2017, BMI2 was further incorporated in [[AMD|AMD's]] [https://en.wikipedia.org/wiki/Zen_(microarchitecture) Zen-architecture] but so far until [https://en.wikipedia.org/wiki/Zen_3 Zen 3] in November 2020 <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75691 Zen3 supports fast PEXT aka BMI2] by [[Alayan Feh]], [[CCC]], November 05, 2020</ref> with a slow implementation of critical instructions such as [[#PDEP|PDEP]] and [[#PEXT|PEXT]] <ref>[http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32016 Ryzen Fritz Chess Benchmarks ?] by ralunger, [[Computer Chess Forums|Rybka Forum]], March 03, 2017</ref> <ref>[https://www.reddit.com/r/Amd/comments/60i6er/ryzen_and_bmi2_strange_behavior_and_high_latencies/ Ryzen and BMI2: Strange behavior and high latencies] by DonnieTinyHands, [https://en.wikipedia.org/wiki/Reddit Reddit], March 20, 2017 » [[AMD]], [[#PEXT|BMI2 PEXT]]</ref><ref>[https://www.techpowerup.com/forums/threads/amd-announces-a-red-october-zen-3-on-october-8-rdna2-on-october-28.271981/page-2#post-4344965 Re: AMD Announces a Red October: Zen 3 on October 8, RDNA2 on October 28] by dragontamer5788, [https://www.techpowerup.com/forums/ TechPowerUp Forum], September 9, 2020</ref> .
=Instructions=
<span id="BZHI"></span>
==BZHI==
Zero High Bits Starting with Specified Bit Position <ref>[http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/intref_cls/common/intref_avx2_bzhi_u.htm _bzhi_u32/u64]</ref>.
<pre>
dest ::= src & ((1 << index)-1);
<span id="PDEP"></span>
==PDEP==
Parallel Bits Deposit <ref>[http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/intref_cls/common/intref_avx2_pdep_u.htm _pdep_u32/u64]</ref>. May be used to map [[First Rank Attacks|first rank attacks]] to any rank, file, or diagonal on the chess board.
===Intrinsic Prototype===
<span id="PEXT"></span>
==PEXT==
Parallel Bits Extract <ref> [http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/intref_cls/common/intref_avx2_pext_u.htm _pext_u32/u64]</ref>. Great to get the [[First Rank Attacks#TheOuterSquares|inner six bit]] occupancy of any rank, file, or diagonal as index of six consecutive lower bit, or the up to 12 bits for a [[#PEXTBitboards|magic attack lookup]].
===Intrinsic Prototype===
An idea for denser sliding attack tables is also using [[#PDEP|PDEP]], dividing the table size of the [[#PEXTBitboards|PEXT Bitboards]] by four but some trailing computation. Looking up [[Word|16-bit words]] instead of bitboards - to get the scattered bits of the up to four ray attacks per square in an extracted form mapped to a 16-bit word, which then becomes the attack bitboard by performing a deposit instruction with an appropriate second mask per square. The technique was introduced by [[Zach Wegner]] in [[CCC]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=40333 Haswell New Instructions] by [[Zach Wegner]], [[CCC]], September 09, 2011</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=40333&start=12 Re: Haswell New Instructions] by [[Gerd Isenberg]], [[CCC]], September 12, 2011</ref> and implemented by [[Ronald de Man]] with a table of slightly more than 210 KiB <ref>[http://www.talkchess.com/forum/viewtopic.php?t=49611&start=1 Re: 152k rook and bishop attacks using PEXT and PDEP] by [[Ronald de Man]], [[CCC]], October 06, 2013</ref> <ref>[https://github.com/syzygy1/tb/blob/master/src/bmi2.h tb/src/bmi2.h at master · syzygy1/tb · GitHub] by [[Ronald de Man]]</ref>. Ronald's code was tested by [[Jean-Francois Romang]] using [[Stockfish]] as testbed, giving promising results <ref>[http://www.talkchess.com/forum/viewtopic.php?t=51879&start=2 Re: Stockfish haswell optimized build] by [[Jean-Francois Romang]], [[CCC]], April 06, 2014</ref> <ref>[https://github.com/jromang/Stockfish/blob/bmi2/src/bitboard.h#L255 Stockfish/src/bitboard.h at bmi2 · jromang/Stockfish · GitHub] by [[Jean-Francois Romang]]</ref>.
==FEN Compression==
Concerning the application of [[Neural Networks|neural network]] training, in particular [[NNUE #Training|NNUE training]] with millions of [[Chess Position|chess position]] inside a huge file of [[Forsyth-Edwards Notation|FEN]] records,
Lucas Braesch proposed the idea to [https://en.wikipedia.org/wiki/Data_compression compress] a chess position by a sequence of [[#PEXT|PEXT]] instructions,
while the decompression is done by an sequence of corresponding [[#PDEP|PDEP]] instructions <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76892&start=3 Re: FEN compression] by lucasart, [[CCC]], March 17, 2021 » [[Forsyth-Edwards Notation]], [[NNUE #Training|NNUE Training]]</ref>.
This is intended as possible alternative of the [https://en.wikipedia.org/wiki/Huffman_coding Huffman coding] as applied in the [[Gary Linscott#PyTorch NNUE|PyTorch NNUE]] code by [[Gary Linscott]] <ref>[https://github.com/glinscott/nnue-pytorch/blob/master/lib/nnue_training_data_formats.h#L6405 nnue-pytorch/nnue_training_data_formats.h at master · glinscott/nnue-pytorch · GitHub]</ref>. The table illustrates the layout:
{| class="wikitable"
! Item
! Bits
! Definition
|-
| occupancy rocc<br/>(remaining occupancy)
| style="text-align:right;" | 64
| style="text-align:center;" | <nowiki>rocc = white | black </nowiki>
|-
| pext(white, rocc)
| style="text-align:right;" | cnt(rocc)
|
|-
| black = rocc ^ white
| style="text-align:right;" | 0
|
|-
| pext(pawns, rocc)
| style="text-align:right;" | cnt(rocc)
|
|-
| pext(knights, rocc)
| style="text-align:right;" | cnt(rocc)
| style="text-align:center;" | rocc ^= pawns
|-
| pext(bishops, rocc)
| style="text-align:right;" | cnt(rocc)
| style="text-align:center;" | rocc ^= bishops
|-
| pext(rooks, rocc)
| style="text-align:right;" | cnt(rocc)
| style="text-align:center;" | rocc ^= rooks
|-
| pext(queens, rocc)
| style="text-align:right;" | cnt(rocc)
| style="text-align:center;" | rocc ^= queens
|-
| kings = rocc
| style="text-align:right;" | 0
|
|-
| side to move
| style="text-align:right;" | 1
|
|-
| ep
| style="text-align:right;" | 1
|
|-
| pext(epSquare, candis)
| style="text-align:right;" | cnt(candis)
| style="text-align:right;" | candis = ((white & pawns & rank4) << 8)<br/> | ((black & pawns & rank6) >> 8)
|-
| pext(castler, rooks)
| style="text-align:right;" | cnt(rooks)
| style="text-align:center;" | castler = rooks with castling rights<br/>(any color)
|-
| rule50
| style="text-align:right;" | 7
|
|}
The uncompressed piece placement bitstream is based on the [[Bitboard Board-Definition#SixTwo|dense bitboard board-definition]]
and consists of the [[Occupancy|occupancy]], one color bitboard (e.g. White pieces),
and subsequently the piece bitboards of both colors, from pawns, knights, bishops, rooks until queens. King bitboards are redundant,
due to the subsequent [[General Setwise Operations#RelativeComplement|relative complement]] of a remaining occupancy with all mentioned piece bitboards.
Instead of storing the sparsely populated piece bitboards, the compression is due to saving their extracted bits from the remaining occupancy -
with bit size of [[Population Count|population count]] of the remaining occupancy, rather than 64.
The pseudo code below omits further aspects of a chess position, such as side to move, castling ability, en passant target square and halfmove clock.
 
<pre>
BitStream compress(const U64 * pieceBB, ...) {
BitStream stream;
U64 rocc = pieceBB[nWhite] | pieceBB[nBlack];
stream.write_n_bit(rocc, 64);
stream.write_n_bit(_pext_u64(pieceBB[nWhite], rocc), popCount(rocc));
for (pt = nPawn; pt <= nQueen; ++pt) {
stream.write_n_bit(_pext_u64(pieceBB[pt], rocc), popCount(rocc));
rocc ^= pieceBB[pt];
}
assert (rocc == pieceBB[nKing]);
// side to move, ep and castling omitted
return stream;
}
 
void decompress (const BitStream &stream, U64 * pieceBB, ...) {
U64 rocc = stream.read_n_bit(64);
pieceBB[nWhite] = _pdep_u64(stream.read_n_bit(popCount(rocc)), rocc);
pieceBB[nBlack] = rocc ^ pieceBB[nWhite];
for (pt = nPawn; pt <= nKing; ++pt) {
pieceBB[pt] = _pdep_u64(stream.read_n_bit(popCount(rocc)), rocc);
rocc ^= pieceBB[pt];
}
assert (rocc == 0);
// side to move, ep and castling omitted
}
</pre>
==Syzygy Generator==
[[Ronald de Man|Ronald de Man's]] generator for [[Syzygy Bases]] can take profit of [[#PDEP|PDEP]] and [[#PEXT|PEXT]] instructions, or to use their serial implementations to further reduce the size of his already compact [[Endgame Tablebases|endgame tablebases]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48220&start=1 Re: PEXT Bitboards] by [[Ronald de Man]], [[CCC]], June 07, 2013</ref>.
<span id="PEXTPDEPProposal"></span>
=Early PEXT/PDEP Proposal=
In late 2006, [[Michael Sherwin]] already proposed a '''PEXTPDEP''' instruction, [[#PEXT|Parallel Bits Extract]] controlled by a source mask followed by [[#PDEP|Parallel Bits Deposit]] controlled by a destination mask <ref>[http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5962 New instruction that intel/amd should add] by [[Michael Sherwin]], [[Computer Chess Forums|Winboard Forum]], December 05, 2006</ref> <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75333&start=8 Re: New AMD Zen 3 and Ryzen processors] by [[Michael Sherwin|Mike Sherwin]], [[CCC]], October 11, 2020</ref>.
However, it is not known whether his proposal was recognized by Intel engineers and had any influence on the design of the BMI2 PEXT and PDEP instructions.
=See also=
* [[General Setwise Operations]]
* [[TBM]]
 
=Manuals=
* [http://software.intel.com/file/36945 Intel AVX and AVX2 Programming Reference] (pdf)
=Forum Posts=
==2006==
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5962 New instruction that intel/amd should add] by [[Michael Sherwin]], [[Computer Chess Forums|Winboard Forum]], December 05, 2006 » [[#PEXTPDEPProposal|Early PEXT/PDEP Proposal]]
==2011 ...==
* [http://www.talkchess.com/forum/viewtopic.php?t=40333 Haswell New Instructions] by [[Zach Wegner]], [[CCC]], September 09, 2011
* [http://www.talkchess.com/forum/viewtopic.php?t=66737&start=4 Re: Komodo 11.3] by [[Mark Lefler]], [[CCC]], March 04, 2018 » [[AMD]], [[#PEXT|BMI2 PEXT]], [[Komodo#11|Komodo 11.3]]
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67432 Ryzen 2 and BMI2?] by [[Steve Maughan]], [[CCC]], May 13, 2018 » [[AMD]]
: [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67432&start=12 Re: Ryzen 2 and BMI2?] by [[Joost Buijs]], [[CCC]], May 18, 2020 » [[AVX2]]
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70167 AMD BMI2 question] by Leo, [[CCC]], March 10, 2019 » [[AMD]]
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=72538 PEXT/PDEP are even slower than you think on Zen] by [[Gian-Carlo Pascutto]], [[CCC]], December 09, 2019 » [[AMD]]
==2020 ...==
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67432&start=12 Re: Ryzen 2 and BMI2?] by [[Joost Buijs]], [[CCC]], May 18, 2020 » [[AMD]], [[AVX2]]
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75333 New AMD Zen 3 and Ryzen processors] by mmt, [[CCC]], October 09, 2020
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75333&start=8 Re: New AMD Zen 3 and Ryzen processors] by [[Michael Sherwin|Mike Sherwin]], [[CCC]], October 11, 2020 » [[#PEXTPDEPProposal|Early PEXT/PDEP Proposal]]
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75691 Zen3 supports fast PEXT aka BMI2] by [[Alayan Feh]], [[CCC]], November 05, 2020 » [[AMD]]
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76892 FEN compression] by lucasart, [[CCC]], March 17, 2021 » [[#FEN Compression|FEN Compression]], [[NNUE #Training|NNUE Training]]
=External Links=
* [https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets Bit Manipulation Instruction Sets from Wikipedia]
* [http://semipublic.comp-arch.net/wiki/Compress_Bits_Under_Mask Compress Bits Under Mask - CompArch]
* [http://software.intel.com/en-us/tags/36092 BMI2 | Intel® Developer Zone]
* [http://www.randombit.net/bitbashing/2012/06/22/haswell_bit_permutations.html Bit manipulations using BMI2 — bitbashing], June 22, 2012
* [http://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family How to detect New Instruction support in the 4th generation Intel® Core™ processor family | Intel® Developer Zone] by [http://software.intel.com/de-de/user/76418 Max Locktyukhin], August 05, 2013
* [http://users.atw.hu/instlatx64/GenuineIntel00306C3_Haswell_InstLatX64.txt Haswell Instructions Latency]
* [http://software.intel.com/sites/landingpage/IntrinsicsGuide/ Intel Intrinsics Guide]
=References=
<references />
 
'''[[x86-64|Up one Level]]'''

Navigation menu