Changes

BMI2

6,097 bytes added, 15:24, 23 June 2021

no edit summary

'''BMI2''',

an x86-64 expansion of [[Bit-Twiddling#BitManipulation|bit-manipulation]] instructions by [[Intel]]. Like [[BMI1]], BMI2 employs [https://en.wikipedia.org/wiki/VEX_prefix VEX prefix] encoding to support three-operand syntax with non-destructive source operands on 32- or 64-bit general-purpose registers. Along with [[AVX2]], BMI2 was expected to be part of [[Intel|Intel's]] [https://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29 Haswell] architecture planned for 2013, but was not yet available in one of the first tested Haswell generations of mid 2013 as reported by Andreas Stiller from the German c't magazine <ref>[http://de.linkedin.com/pub/andreas-stiller/a/381/aa9 Andreas Stiller] ('''2013'''). ''[http://www.heise.de/ct/inhalt/2013/14/114/ Der Rechenkünstler]''. [http://www.heise.de/ct/ c't Magazin für Computertechnik] 14/2013, p. 114-119 (German)</ref>. BMI2 requires bit 8 set in EBX of [https://en.wikipedia.org/wiki/CPUID CPUID] with EAX=07H, ECX=0H <ref>[http://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family How to detect New Instruction support in the 4th generation Intel® Core™ processor family | Intel® Developer Zone] by [http://software.intel.com/de-de/user/76418 Max Locktyukhin], August 05, 2013</ref>. In 2017, BMI2 was further incorporated in [[AMD|AMD's]] [https://en.wikipedia.org/wiki/Zen_(microarchitecture) Zen-architecture] but ~~so far~~ until [https://en.wikipedia.org/wiki/Zen_3 Zen 3] in November 2020 <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75691 Zen3 supports fast PEXT aka BMI2] by [[Alayan Feh]], [[CCC]], November 05, 2020</ref> with a slow implementation of critical instructions such as [[#PDEP|PDEP]] and [[#PEXT|PEXT]] <ref>[http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32016 Ryzen Fritz Chess Benchmarks ?] by ralunger, [[Computer Chess Forums|Rybka Forum]], March 03, 2017</ref> <ref>[https://www.reddit.com/r/Amd/comments/60i6er/ryzen_and_bmi2_strange_behavior_and_high_latencies/ Ryzen and BMI2: Strange behavior and high latencies] by DonnieTinyHands, [https://en.wikipedia.org/wiki/Reddit Reddit], March 20, 2017 » [[AMD]], [[#PEXT|BMI2 PEXT]]</ref><ref>[https://www.techpowerup.com/forums/threads/amd-announces-a-red-october-zen-3-on-october-8-rdna2-on-october-28.271981/page-2#post-4344965 Re: AMD Announces a Red October: Zen 3 on October 8, RDNA2 on October 28] by dragontamer5788, [https://www.techpowerup.com/forums/ TechPowerUp Forum], September 9, 2020</ref> .

=Instructions=

==BZHI==

Zero High Bits Starting with Specified Bit Position ~~<ref>[http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/intref_cls/common/intref_avx2_bzhi_u.htm _bzhi_u32/u64]</ref>~~.

<pre>

dest ::= src & ((1 << index)-1);

==PDEP==

Parallel Bits Deposit ~~<ref>[http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/intref_cls/common/intref_avx2_pdep_u.htm _pdep_u32/u64]</ref>~~. May be used to map [[First Rank Attacks|first rank attacks]] to any rank, file, or diagonal on the chess board.

===Intrinsic Prototype===

==PEXT==

Parallel Bits Extract ~~<ref> [http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/intref_cls/common/intref_avx2_pext_u.htm _pext_u32/u64]</ref>~~. Great to get the [[First Rank Attacks#TheOuterSquares|inner six bit]] occupancy of any rank, file, or diagonal as index of six consecutive lower bit, or the up to 12 bits for a [[#PEXTBitboards|magic attack lookup]].

===Intrinsic Prototype===

An idea for denser sliding attack tables is also using [[#PDEP|PDEP]], dividing the table size of the [[#PEXTBitboards|PEXT Bitboards]] by four but some trailing computation. Looking up [[Word|16-bit words]] instead of bitboards - to get the scattered bits of the up to four ray attacks per square in an extracted form mapped to a 16-bit word, which then becomes the attack bitboard by performing a deposit instruction with an appropriate second mask per square. The technique was introduced by [[Zach Wegner]] in [[CCC]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=40333 Haswell New Instructions] by [[Zach Wegner]], [[CCC]], September 09, 2011</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=40333&start=12 Re: Haswell New Instructions] by [[Gerd Isenberg]], [[CCC]], September 12, 2011</ref> and implemented by [[Ronald de Man]] with a table of slightly more than 210 KiB <ref>[http://www.talkchess.com/forum/viewtopic.php?t=49611&start=1 Re: 152k rook and bishop attacks using PEXT and PDEP] by [[Ronald de Man]], [[CCC]], October 06, 2013</ref> <ref>[https://github.com/syzygy1/tb/blob/master/src/bmi2.h tb/src/bmi2.h at master · syzygy1/tb · GitHub] by [[Ronald de Man]]</ref>. Ronald's code was tested by [[Jean-Francois Romang]] using [[Stockfish]] as testbed, giving promising results <ref>[http://www.talkchess.com/forum/viewtopic.php?t=51879&start=2 Re: Stockfish haswell optimized build] by [[Jean-Francois Romang]], [[CCC]], April 06, 2014</ref> <ref>[https://github.com/jromang/Stockfish/blob/bmi2/src/bitboard.h#L255 Stockfish/src/bitboard.h at bmi2 · jromang/Stockfish · GitHub] by [[Jean-Francois Romang]]</ref>.

==FEN Compression==

Concerning the application of [[Neural Networks|neural network]] training, in particular [[NNUE #Training|NNUE training]] with millions of [[Chess Position|chess position]] inside a huge file of [[Forsyth-Edwards Notation|FEN]] records,

Lucas Braesch proposed the idea to [https://en.wikipedia.org/wiki/Data_compression compress] a chess position by a sequence of [[#PEXT|PEXT]] instructions,

while the decompression is done by an sequence of corresponding [[#PDEP|PDEP]] instructions <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76892&start=3 Re: FEN compression] by lucasart, [[CCC]], March 17, 2021 » [[Forsyth-Edwards Notation]], [[NNUE #Training|NNUE Training]]</ref>.

This is intended as possible alternative of the [https://en.wikipedia.org/wiki/Huffman_coding Huffman coding] as applied in the [[Gary Linscott#PyTorch NNUE|PyTorch NNUE]] code by [[Gary Linscott]] <ref>[https://github.com/glinscott/nnue-pytorch/blob/master/lib/nnue_training_data_formats.h#L6405 nnue-pytorch/nnue_training_data_formats.h at master · glinscott/nnue-pytorch · GitHub]</ref>. The table illustrates the layout:

{| class="wikitable"

! Item

! Bits

! Definition

|-

| occupancy rocc (remaining occupancy)

| style="text-align:right;" | 64

| style="text-align:center;" | <nowiki>rocc = white | black </nowiki>

|-

| pext(white, rocc)

| style="text-align:right;" | cnt(rocc)

|

|-

| black = rocc ^ white

| style="text-align:right;" | 0

|

|-

| pext(pawns, rocc)

| style="text-align:right;" | cnt(rocc)

|

|-

| pext(knights, rocc)

| style="text-align:right;" | cnt(rocc)

| style="text-align:center;" | rocc ^= pawns

|-

| pext(bishops, rocc)

| style="text-align:right;" | cnt(rocc)

| style="text-align:center;" | rocc ^= bishops

|-

| pext(rooks, rocc)

| style="text-align:right;" | cnt(rocc)

| style="text-align:center;" | rocc ^= rooks

|-

| pext(queens, rocc)

| style="text-align:right;" | cnt(rocc)

| style="text-align:center;" | rocc ^= queens

|-

| kings = rocc

| style="text-align:right;" | 0

|

|-

| side to move

| style="text-align:right;" | 1

|

|-

| ep

| style="text-align:right;" | 1

|

|-

| pext(epSquare, candis)

| style="text-align:right;" | cnt(candis)

| style="text-align:right;" | candis = ((white & pawns & rank4) << 8) | ((black & pawns & rank6) >> 8)

|-

| pext(castler, rooks)

| style="text-align:right;" | cnt(rooks)

| style="text-align:center;" | castler = rooks with castling rights (any color)

|-

| rule50

| style="text-align:right;" | 7

|

|}

The uncompressed piece placement bitstream is based on the [[Bitboard Board-Definition#SixTwo|dense bitboard board-definition]]

and consists of the [[Occupancy|occupancy]], one color bitboard (e.g. White pieces),

and subsequently the piece bitboards of both colors, from pawns, knights, bishops, rooks until queens. King bitboards are redundant,

due to the subsequent [[General Setwise Operations#RelativeComplement|relative complement]] of a remaining occupancy with all mentioned piece bitboards.

Instead of storing the sparsely populated piece bitboards, the compression is due to saving their extracted bits from the remaining occupancy -

with bit size of [[Population Count|population count]] of the remaining occupancy, rather than 64.

The pseudo code below omits further aspects of a chess position, such as side to move, castling ability, en passant target square and halfmove clock.

<pre>

BitStream compress(const U64 * pieceBB, ...) {

BitStream stream;

U64 rocc = pieceBB[nWhite] | pieceBB[nBlack];

stream.write_n_bit(rocc, 64);

stream.write_n_bit(_pext_u64(pieceBB[nWhite], rocc), popCount(rocc));

for (pt = nPawn; pt <= nQueen; ++pt) {

stream.write_n_bit(_pext_u64(pieceBB[pt], rocc), popCount(rocc));

rocc ^= pieceBB[pt];

}

assert (rocc == pieceBB[nKing]);

// side to move, ep and castling omitted

return stream;

}

void decompress (const BitStream &stream, U64 * pieceBB, ...) {

U64 rocc = stream.read_n_bit(64);

pieceBB[nWhite] = _pdep_u64(stream.read_n_bit(popCount(rocc)), rocc);

pieceBB[nBlack] = rocc ^ pieceBB[nWhite];

for (pt = nPawn; pt <= nKing; ++pt) {

pieceBB[pt] = _pdep_u64(stream.read_n_bit(popCount(rocc)), rocc);

rocc ^= pieceBB[pt];

}

assert (rocc == 0);

// side to move, ep and castling omitted

}

</pre>

==Syzygy Generator==

[[Ronald de Man|Ronald de Man's]] generator for [[Syzygy Bases]] can take profit of [[#PDEP|PDEP]] and [[#PEXT|PEXT]] instructions, or to use their serial implementations to further reduce the size of his already compact [[Endgame Tablebases|endgame tablebases]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48220&start=1 Re: PEXT Bitboards] by [[Ronald de Man]], [[CCC]], June 07, 2013</ref>.

=Early PEXT/PDEP Proposal=

In late 2006, [[Michael Sherwin]] already proposed a '''PEXTPDEP''' instruction, [[#PEXT|Parallel Bits Extract]] controlled by a source mask followed by [[#PDEP|Parallel Bits Deposit]] controlled by a destination mask <ref>[http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5962 New instruction that intel/amd should add] by [[Michael Sherwin]], [[Computer Chess Forums|Winboard Forum]], December 05, 2006</ref> <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75333&start=8 Re: New AMD Zen 3 and Ryzen processors] by [[Michael Sherwin|Mike Sherwin]], [[CCC]], October 11, 2020</ref>.

However, it is not known whether his proposal was recognized by Intel engineers and had any influence on the design of the BMI2 PEXT and PDEP instructions.

=See also=

* [[General Setwise Operations]]

* [[TBM]]

~~=Manuals=~~

* [http://software.intel.com/file/36945 Intel AVX and AVX2 Programming Reference] (pdf)

=Forum Posts=

==2006==

* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5962 New instruction that intel/amd should add] by [[Michael Sherwin]], [[Computer Chess Forums|Winboard Forum]], December 05, 2006 » [[#PEXTPDEPProposal|Early PEXT/PDEP Proposal]]

==2011 ...==

* [http://www.talkchess.com/forum/viewtopic.php?t=40333 Haswell New Instructions] by [[Zach Wegner]], [[CCC]], September 09, 2011

* [http://www.talkchess.com/forum/viewtopic.php?t=66737&start=4 Re: Komodo 11.3] by [[Mark Lefler]], [[CCC]], March 04, 2018 » [[AMD]], [[#PEXT|BMI2 PEXT]], [[Komodo#11|Komodo 11.3]]

* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67432 Ryzen 2 and BMI2?] by [[Steve Maughan]], [[CCC]], May 13, 2018 » [[AMD]]

: [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67432&start=12 Re: Ryzen 2 and BMI2?] by [[Joost Buijs]], [[CCC]], May 18, 2020 » [[AVX2]]

* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70167 AMD BMI2 question] by Leo, [[CCC]], March 10, 2019 » [[AMD]]

* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=72538 PEXT/PDEP are even slower than you think on Zen] by [[Gian-Carlo Pascutto]], [[CCC]], December 09, 2019 » [[AMD]]

==2020 ...==

* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67432&start=12 Re: Ryzen 2 and BMI2?] by [[Joost Buijs]], [[CCC]], May 18, 2020 » [[AMD]], [[AVX2]]

* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75333 New AMD Zen 3 and Ryzen processors] by mmt, [[CCC]], October 09, 2020

: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75333&start=8 Re: New AMD Zen 3 and Ryzen processors] by [[Michael Sherwin|Mike Sherwin]], [[CCC]], October 11, 2020 » [[#PEXTPDEPProposal|Early PEXT/PDEP Proposal]]

* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75691 Zen3 supports fast PEXT aka BMI2] by [[Alayan Feh]], [[CCC]], November 05, 2020 » [[AMD]]

* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76892 FEN compression] by lucasart, [[CCC]], March 17, 2021 » [[#FEN Compression|FEN Compression]], [[NNUE #Training|NNUE Training]]

=External Links=

* [https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets Bit Manipulation Instruction Sets from Wikipedia]

* [http://semipublic.comp-arch.net/wiki/Compress_Bits_Under_Mask Compress Bits Under Mask - CompArch]

* [http://software.intel.com/en-us/tags/36092 BMI2 | Intel® Developer Zone]

* [http://www.randombit.net/bitbashing/2012/06/22/haswell_bit_permutations.html Bit manipulations using BMI2 — bitbashing], June 22, 2012

* [http://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family How to detect New Instruction support in the 4th generation Intel® Core™ processor family | Intel® Developer Zone] by [http://software.intel.com/de-de/user/76418 Max Locktyukhin], August 05, 2013

* [http://users.atw.hu/instlatx64/GenuineIntel00306C3_Haswell_InstLatX64.txt Haswell Instructions Latency]

* [http://software.intel.com/sites/landingpage/IntrinsicsGuide/ Intel Intrinsics Guide]

=References=

'''[[x86-64|Up one Level]]'''

GerdIsenberg

Bureaucrats, Administrators

25,161

edits

Changes

BMI2

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools