Difference between revisions of "BitScan"
GerdIsenberg (talk  contribs) 
GerdIsenberg (talk  contribs) 

Line 1,196:  Line 1,196:  
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74989 Looking for intrinsic "least significant bit" on Visual Studio] by [[Oliver Brausch]], [[CCC]], September 03, 2020 » [[BitScan#TrailingZeroCountTrailing Zero Count]]  * [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74989 Looking for intrinsic "least significant bit" on Visual Studio] by [[Oliver Brausch]], [[CCC]], September 03, 2020 » [[BitScan#TrailingZeroCountTrailing Zero Count]]  
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=75818 C++20 standard bit operations] by [[Jon Dart]], [[CCC]], November 15, 2020 » [[General Setwise Operations]], [[Population Count]], [[CppC++]]  * [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=75818 C++20 standard bit operations] by [[Jon Dart]], [[CCC]], November 15, 2020 » [[General Setwise Operations]], [[Population Count]], [[CppC++]]  
+  * [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79739 Optimizing Matt Taylor's Folding trick] by [[Daniel Infuehr]], [[CCC]], April 21, 2022  
=External Links=  =External Links= 
Latest revision as of 18:33, 28 April 2022
Home * Board Representation * Bitboards * BitScan
BitScan,
a function that determines the bitindex of the least significant 1 bit (LS1B) or the most significant 1 bit (MS1B) in an integer such as bitboards. If exactly one bit is set in an unsigned integer, representing a numerical value of a power of two, this is equivalent to a base2 logarithm. Many implementations have been devised since the advent of bitboards, as described on this page, and some implementation samples of concrete open source engines listed for didactic purpose.
Contents
Hardware vs. Software
For recent x8664 architectures like Core 2 duo and K10, one should use the Processor Instructions for Bitscans via intrinsics or inline assembly, see x8664 timing. P4 and K8 have rather slow bitscaninstructions. K8 uses so called vector path instructions ^{[2]} with 9 or 11 cycles latency, even blocking other processor resources. For these processors, specially K8 with already fast multiplication, the De Bruijn Multiplication (64bit mode) or Matt Taylor's Folded 32bit Multiplication (32bit mode) might be the right choice. Other routines mentioned might be advantageous on certain architectures, specially with slow integer multiplications.
Non Empty Sets
Bitscan is most often used in serializing bitboards, and is therefor  due to a leading whilecondition  not called with empty sets. Until stated otherwise, most mentioned bitscanroutines in C/C++ have the same prototype and assume none empty sets as actual parameter.
Bitscan forward
A bitscan forward is used to find the index of the least significant 1 bit (LS1B).
Trailing Zero Count
Bitscan forward is identical with a Trailing Zero Count for none empty sets, possibly available as machine instruction on some architectures, for instance the x8664 bitmanipulation expansion set BMI1, with some difficulties to use the _tzcnt_u64 intrisic with Visual Studio ^{[3]}.
De Bruijn Multiplication
The De Bruijn bitscan was devised in 1997, according to Donald Knuth ^{[4]} by Martin Läuter, and independently by Charles Leiserson, Harald Prokop and Keith H. Randall a few month later ^{[5]} ^{[6]} , to determine the LS1B index by minimal perfect hashing. De Bruijn sequences were named after the Dutch mathematician Nicolaas de Bruijn. Interestingly sequences with the binary alphabet were already investigated by the French mathematician Camille Flye SainteMarie in 1894, but later "forgotten" and reinvestigated and generalized by De Bruijn and Tanja van ArdenneEhrenfest half a century later ^{[7]} .
A 64bit De Bruijn Sequence contains 64overlapped unique 6bit sequences, thus a circle of 64 bits, where five leading zeros overlap five hidden "trailing" zeros. There are 226 = 67108864 odd sequences with 6 leading binary zeros and 226 even sequences with 5 leading binary zeros, which may be calculated from the odd ones by shifting left one.
With isolated LS1B
A multiplication with a power of two value (the isolated LS1B) acts like a left shift by it's exponent. Thus, if we multiply a 64bit De Bruijn Sequence with the isolated LS1B, we get a unique six bit subsequence inside the most significant bits. To obtain the bitindex we need to extract these upper six bits by shifting right the product, to lookup an array.
const int index64[64] = { 0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4, 62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5, 63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11, 46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6 }; /** * bitScanForward * @author Martin Läuter (1997) * Charles E. Leiserson * Harald Prokop * Keith H. Randall * "Using de Bruijn Sequences to Index a 1 in a Computer Word" * @param bb bitboard to scan * @precondition bb != 0 * @return index (0..63) of least significant one bit */ int bitScanForward(U64 bb) { const U64 debruijn64 = C64(0x03f79d71b4cb0a89); assert (bb != 0); return index64[((bb & bb) * debruijn64) >> 58]; }
See also how to Generate your "private" De Bruijn Bitscan Routine.
With separated LS1B
Instead of the classical LS1B isolation, Kim Walisch proposed the faster xor with the ones' decrement. The separation bb ^ (bb1) contains all bits set including and below the LS1B. The 222 (4,194,304) upper De Bruijn Sequences of the 226 available leave unique 6bit indices. Using LS1B separation takes advantage of the x86 lea instruction, which saves the move instruction and unlike negate, has no data dependency on the flag register. Kim reported a 10 to 15 percent faster execution (compilers: g++4.7 O2, clang++3.1 O2, x86_64) than the traditional 64bit De Bruijn bitscan on Intel Nehalem and Sandy Bridge CPUs.
const int index64[64] = { 0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61, 54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62, 46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45, 25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63 }; /** * bitScanForward * @author Kim Walisch (2012) * @param bb bitboard to scan * @precondition bb != 0 * @return index (0..63) of least significant one bit */ int bitScanForward(U64 bb) { const U64 debruijn64 = C64(0x03f79d71b4cb0a89); assert (bb != 0); return index64[((bb ^ (bb1)) * debruijn64) >> 58]; }
Matt Taylor's Folding trick
A 32bit friendly implementation to find the the bitindex of LS1B by Matt Taylor ^{[8]} ^{[9]}. The xor with the ones' decrement, bb ^ (bb1) contains all bits set including and below the LS1B. The 32bit xordifference of both halves yields either the complement of the upper half, or the lower half otherwise. Some samples:
ls1b  bb ^ (bb1)  folded 

63  0xffffffffffffffff

0x00000000

62  0x7fffffffffffffff

0x80000000

59  0x0fffffffffffffff

0xf0000000

32  0x00000001ffffffff

0xfffffffe

31  0x00000000ffffffff

0xffffffff

30  0x000000007fffffff

0x7fffffff

0  0x0000000000000001

0x00000001

Even if this folded "LS1B" contains multiple consecutive onebits, the multiplication is De Bruijn like. There are only two magic 32bit constants with the combined property of 32 and 64bit De Bruijn Sequences to apply this minimal perfect hashing:
const int lsb_64_table[64] = { 63, 30, 3, 32, 59, 14, 11, 33, 60, 24, 50, 9, 55, 19, 21, 34, 61, 29, 2, 53, 51, 23, 41, 18, 56, 28, 1, 43, 46, 27, 0, 35, 62, 31, 58, 4, 5, 49, 54, 6, 15, 52, 12, 40, 7, 42, 45, 16, 25, 57, 48, 13, 10, 39, 8, 44, 20, 47, 38, 22, 17, 37, 36, 26 }; /** * bitScanForward * @author Matt Taylor (2003) * @param bb bitboard to scan * @precondition bb != 0 * @return index (0..63) of least significant one bit */ int bitScanForward(U64 bb) { unsigned int folded; assert (bb != 0); bb ^= bb  1; folded = (int) bb ^ (bb >> 32); return lsb_64_table[folded * 0x78291ACF >> 26]; }
A slightly modified version may take one x86register less in 32bit mode, but calculates bb1 twice:
int bitScanForwardM(BitBoard bb) { unsigned int folded; assert (bb != 0); folded = (int)((bb ^ (bb1)) >> 32); folded ^= (int)( bb ^ (bb1)); // lea return lsb_64_table[folded * 0x78291ACF >> 26]; }
with this VC6 generated x86 assembly to compare:
bitScanForward PROC NEAR bitScanForwardM PROC NEAR mov ecx, DWORD PTR _bb$[esp4] mov eax, DWORD PTR _bb$[esp4] mov eax, DWORD PTR _bb$[esp] mov ecx, eax mov edx, ecx add ecx, 1 push esi mov ecx, DWORD PTR _bb$[esp] add edx, 1 mov edx, ecx mov esi, eax adc edx, 1 adc esi, 1 xor edx, ecx xor ecx, edx lea ecx, DWORD PTR [eax1] xor eax, esi xor edx, ecx pop esi xor eax, ecx xor edx, eax imul eax, 78291acfH imul edx, 78291acfH shr eax, 26 shr edx, 26 mov eax, DWORD PTR _lsb_64_table[eax*4] mov eax, DWORD PTR _lsb_64_table[edx*4] ret 0 ret 0 bitScanForward ENDP bitScanForward ENDP
Walter Faxon's magic Bitscan
Walter Faxon's 32bit friendly magic bitscan ^{[10]} uses a fast none minimal perfect hashing function:
const char LSB_64_table[154] = { #define __ 0 22,__,__,__,30,__,__,38,18,__, 16,15,17,__,46, 9,19, 8, 7,10, 0, 63, 1,56,55,57, 2,11,__,58, __,__,20,__, 3,__,__,59,__,__, __,__,__,12,__,__,__,__,__,__, 4,__,__,60,__,__,__,__,__,__, __,__,__,__,21,__,__,__,29,__, __,37,__,__,__,13,__,__,45,__, __,__, 5,__,__,61,__,__,__,53, __,__,__,__,__,__,__,__,__,__, 28,__,__,36,__,__,__,__,__,__, 44,__,__,__,__,__,27,__,__,35, __,52,__,__,26,__,43,34,25,23, 24,33,31,32,42,39,40,51,41,14, __,49,47,48,__,50, 6,__,__,62, __,__,__,54 #undef __ }; /** * bitScanForward * @author Walter Faxon, slightly modified * @param bb bitboard to scan * @precondition bb != 0 * @return index (0..63) of least significant one bit */ int bitScanForward(U64 bb) { unsigned int t32; assert(bb); bb ^= bb  1; t32 = (int)bb ^ (int)(bb >> 32); t32 ^= 0x01C5FC81; t32 += t32 >> 16; t32 = (t32 >> 8) + 51; return LSB_64_table [t32 & 255]; // 0..63 }
A slightly modified version may take one x86register less in 32bit mode, but calculates bb1 twice:
int bitScanForward(U64 bb) { int t32 = 0x01C5FC81; assert(bb); t32 ^= (int)((bb ^ (bb1)) >> 32); t32 ^= (int)( bb ^ (bb1)); // lea t32 += t32 >> 16; t32 =(t32 >> 8) + 51; return LSB_64_table [t32 & 255]; }
The initial LS1B separation by bb ^ (bb1) and folding is equivalent to Matt's,
ls1b  bb ^ (bb1)  folded 

63  0xffffffffffffffff

0x00000000

62  0x7fffffffffffffff

0x80000000

59  0x0fffffffffffffff

0xf0000000

32  0x00000001ffffffff

0xfffffffe

31  0x00000000ffffffff

0xffffffff

30  0x000000007fffffff

0x7fffffff

0  0x0000000000000001

0x00000001

while Walter originally resets the LS1B, yielding in a cyclic index wrap:
LS1B  (bb & (bb1)) ^ (bb1)  folded 

0  0x0000000000000000

0x00000000

63  0x7fffffffffffffff

0x80000000

60  0x0fffffffffffffff

0xf0000000

33  0x00000001ffffffff

0xfffffffe

32  0x00000000ffffffff

0xffffffff

31  0x000000007fffffff

0x7fffffff

1  0x0000000000000001

0x00000001

Bitscan by Modulo
Another idea is to apply a modulo (remainder of a division) operation of the isolated LS1B by the prime number 67 ^{[11]} ^{[12]} . The remainder 0..66 can be used to perfectly hash the bitindex table. Three gaps are 0, 17, and 34, so the mod 67 can make a branchless trailing zero count:
BitIndex  Bitboard  mod 67 

  0x0000000000000000

0 
0  0x0000000000000001

1 
1  0x0000000000000002

2 
2  0x0000000000000004

4 
3  0x0000000000000008

8 
4  0x0000000000000010

16 
5  0x0000000000000020

32 
6  0x0000000000000040

64 
7  0x0000000000000080

61 
8  0x0000000000000100

55 
9  0x0000000000000200

43 
10  0x0000000000000400

19 
11  0x0000000000000800

38 
12  0x0000000000001000

9 
13  0x0000000000002000

18 
14  0x0000000000004000

36 
15  0x0000000000008000

5 
16  0x0000000000010000

10 
17  0x0000000000020000

20 
18  0x0000000000040000

40 
19  0x0000000000080000

13 
20  0x0000000000100000

26 
21  0x0000000000200000

52 
22  0x0000000000400000

37 
23  0x0000000000800000

7 
24  0x0000000001000000

14 
25  0x0000000002000000

28 
26  0x0000000004000000

56 
27  0x0000000008000000

45 
28  0x0000000010000000

23 
29  0x0000000020000000

46 
30  0x0000000040000000

25 
31  0x0000000080000000

50 
32  0x0000000100000000

33 
33  0x0000000200000000

66 
34  0x0000000400000000

65 
35  0x0000000800000000

63 
36  0x0000001000000000

59 
37  0x0000002000000000

51 
38  0x0000004000000000

35 
39  0x0000008000000000

3 
40  0x0000010000000000

6 
41  0x0000020000000000

12 
42  0x0000040000000000

24 
43  0x0000080000000000

48 
44  0x0000100000000000

29 
45  0x0000200000000000

58 
46  0x0000400000000000

49 
47  0x0000800000000000

31 
48  0x0001000000000000

62 
49  0x0002000000000000

57 
50  0x0004000000000000

47 
51  0x0008000000000000

27 
52  0x0010000000000000

54 
53  0x0020000000000000

41 
54  0x0040000000000000

15 
55  0x0080000000000000

30 
56  0x0100000000000000

60 
57  0x0200000000000000

53 
58  0x0400000000000000

39 
59  0x0800000000000000

11 
60  0x1000000000000000

22 
61  0x2000000000000000

44 
62  0x4000000000000000

21 
63  0x8000000000000000

42 
/** * trailingZeroCount * @param bb bitboard to scan * @return index (0..63) of least significant one bit, 64 if bb is zero */ int trailingZeroCount(U64 bb) { static const int lookup67[67+1] = { 64, 0, 1, 39, 2, 15, 40, 23, 3, 12, 16, 59, 41, 19, 24, 54, 4, 1, 13, 10, 17, 62, 60, 28, 42, 30, 20, 51, 25, 44, 55, 47, 5, 32, 1, 38, 14, 22, 11, 58, 18, 53, 63, 9, 61, 27, 29, 50, 43, 46, 31, 37, 21, 57, 52, 8, 26, 49, 45, 36, 56, 7, 48, 35, 6, 34, 33, 1 }; return lookup67[(bb & bb) % 67]; }
Since div/mod is an expensive instruction, a modulo by a constant is likely replaced by reciprocal fixed point multiplication to get the quotient and a second multiplication and difference to get the remainder. Compared with De Bruijn multiplication it is still too slow.
Divide and Conquer
This is a broad group of bitscans that test in succession, like the trailing zero count based on Reinhard Scharnagl's proposal ^{[13]} :
/** * trailingZeroCount * like bitScanForward for none empty sets * @author Reinhard Scharnagl * @param bb bitboard to scan * @return index (0..64) */ unsigned char lsbRS[256] = { 8, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0 }; int trailingZeroCount(U64 b) { unsigned buf; int acc = 0; if ((buf = (unsigned)b) == 0) { buf = (unsigned)(b >> 32); acc = 32; } if ((unsigned short)buf == 0) { buf >>= 16; acc += 16; } if ((unsigned char)buf == 0) { buf >>= 8; acc += 8; } return acc + lsbRS[buf & 0xff]; }
What about direct calculation? On x86 this is a chain of test, set and lea instructions:
/** * bitScanForward * @author Gerd Isenberg * @param bb bitboard to scan * @precondition bb != 0 * @return index (0..63) of least significant one bit */ int bitScanForward(U64 bb) { unsigned int lsb; assert (bb != 0); bb &= bb; // LS1BIsolation lsb = (unsigned)bb  (unsigned)(bb>>32); return (((((((((((unsigned)(bb>>32) !=0) * 2) + ((lsb & 0xffff0000) !=0)) * 2) + ((lsb & 0xff00ff00) !=0)) * 2) + ((lsb & 0xf0f0f0f0) !=0)) * 2) + ((lsb & 0xcccccccc) !=0)) * 2) + ((lsb & 0xaaaaaaaa) !=0); }
Double conversion of LS1B
Assuming 64bit doubles and littleendian structure (not portable). We convert the isolated LS1B to a double and interprete the exponent:
/** * bitScanForward * @author Gerd Isenberg * @param bb bitboard to scan * @return index (0..63) of least significant one bit * 1023 if passing zero */ int bitScanForward(U64 bb) { union { double d; struct { unsigned int mantissal : 32; unsigned int mantissah : 20; unsigned int exponent : 11; unsigned int sign : 1; }; } ud; ud.d = (double)(bb & bb); // isolated LS1B to double return ud.exponent  1023; }
Index of LS1B by Popcount
If we have a fast populationcount instruction, we can count the trailing zeros of LS1B after subtracting one:
// precondition bb != 0 int bitScanForward(U64 bb) { assert (bb != 0); return popCount( (bb & bb)  1 ); }
Bitscan reverse
A bitscan reverse is used to find the index of the most significant 1 bit (MS1B). For non empty sets it is equivalent to floor of the base2 logarithm. MS1B isolalation or separation is more expensive than LS1B isolalation or separation, due to the LS1B related Two's complement tricks are not applicable. However, beside Divide and Conquer and Double conversion, Bitscan reverse with MS1B separation is mentioned.
Divide and Conquer
As introduced by Eugene Nalimov in 2000, for an IA64 version of Crafty ^{[14]} ^{[15]}
/** * bitScanReverse * @author Eugene Nalimov * @param bb bitboard to scan * @return index (0..63) of most significant one bit */ int bitScanReverse(U64 bb) { int result = 0; if (bb > 0xFFFFFFFF) { bb >>= 32; result = 32; } if (bb > 0xFFFF) { bb >>= 16; result += 16; } if (bb > 0xFF) { bb >>= 8; result += 8; } return result + ms1bTable[bb]; }
Tribute to Frank Zappa
A branchless and little bit obfuscated version of the devide and conquer bitScanReverse with inregisterlookup ^{[16]}  as tribute to Frank Zappa with identifiers from Freak Out! (1966), Hot Rats (1969), Waka/Jawaka (1972), Sofa (1975), One Size Fits All (1975), Sheik Yerbouti (1979), and Jazz from Hell (1986):
typedef unsigned __int64 OneSizeFits; typedef unsigned int HotRats; const HotRats s = 0; const HotRats heik = 457; const HotRats y = 1; const HotRats e = 2; const HotRats r = 3; const HotRats b = 4; const HotRats o = 5; const HotRats u = 8; const HotRats t = 16; const HotRats i = 32; const HotRats ka = (1<< 4)1; const HotRats waka = (1<< 8)1; const HotRats jawaka = (1<<16)1; const HotRats jazzFromHell = 0(16*3*heik); HotRats freakOut(OneSizeFits all) { HotRats so,fa; fa = (HotRats)(all >> i); so = (fa!=s) << o; fa ^= (HotRats) all & (fa!=s)y; so ^= (jawaka < fa) << b; fa >>= (jawaka < fa) << b; so ^= ( waka  fa) >> t & u; fa >>= ( waka  fa) >> t & u; so ^= ( ka  fa) >> u & b; fa >>= ( ka  fa) >> u & b; so ^= jazzFromHell >> e*fa & r; return so; }
De Bruijn Multiplication
While the tribute to Frank Zappa is quite 32bit friendly ^{[17]}, Kim Walisch suggested to use the parallel prefix fill for a MS1B separation with the same De Bruijn multiplication and lookup as in his bitScanForward routine with separated LS1B, with less instructions in 64bit mode. A log base 2 method was already devised by Eric Cole on January 8, 2006, and shaved off rounded up to one less than the next power of 2 by Mark Dickinson ^{[18]} on December 10, 2009, as published in Sean Eron Anderson's Bit Twiddling Hacks for 32bit integers ^{[19]}.
const int index64[64] = { 0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61, 54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62, 46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45, 25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63 }; /** * bitScanReverse * @authors Kim Walisch, Mark Dickinson * @param bb bitboard to scan * @precondition bb != 0 * @return index (0..63) of most significant one bit */ int bitScanReverse(U64 bb) { const U64 debruijn64 = C64(0x03f79d71b4cb0a89); assert (bb != 0); bb = bb >> 1; bb = bb >> 2; bb = bb >> 4; bb = bb >> 8; bb = bb >> 16; bb = bb >> 32; return index64[(bb * debruijn64) >> 58]; }
Double conversion
Assuming 64bit doubles and littleendian structure (not portable!). Conversion to a double, interpreting the exponent. To avoid possible rounding errors, some lower bits may be cleared.
/** * bitScanReverse * @author Gerd Isenberg * @param bb bitboard to scan * @return index (0..63) of most significant one bit * 1023 if passing zero */ int bitScanReverse(U64 bb) { union { double d; struct { unsigned int mantissal : 32; unsigned int mantissah : 20; unsigned int exponent : 11; unsigned int sign : 1; }; } ud; ud.d = (double)(bb & ~(bb >> 32)); // avoid rounding error return ud.exponent  1023; }
Leading Zero Count
Some processors have a fast leading zero count instruction. The Motorola 68020 has a bit field find first one instruction (BFFFO), which actually performs an up to 32bit Leading Zero Count ^{[20]} . x8664 AMD K10 has lzcnt as part of the SSE4a extension ^{[21]} ^{[22]} , BMI1 has lzcnt as well, while AVX512CD even features leading zero count on vectors of eight bitbaords.
One can replace bitScanReverse of non empty sets by leadingZeroCount xor 63. Like trailing zero count, it returns 64 for empty sets, and might therefor save the leading condition in some applications.
Bitscan versus Zero Count
While the presented bitscan routines are suited to work only on none empty sets and return a valuerange from 0 to 63 as bitindex, leading or trailing zerocount instructions or routines leave 64 for empty sets. Zerocounting has a immanent property of dealing correctly with empty sets  while it likely takes a conditional branch to implement this semantic in bitscanning.
int trailingZeroCount(U64 bb) { if ( bb ) return bitScanForward(bb); return 64; } int leadingZeroCount(U64 bb) { if ( bb ) return bitScanReverse(bb) ^ 63; return 64; }
Bitscan with Reset
While traversing sets, one may combine bitscanning with reset found bit. That implies passing the bitboard per reference or pointer, and tends to confuse compilers to keep all inside registers inside a typical serialization loop ^{[23]} ^{[24]}.
int bitScanForwardWithReset(U64 &bb) { // also called dropForward int idx = bitScanForward(bb); bb &= bb  1; // reset bit outside return idx; }
Generalized Bitscan
This generalized bitscan uses a boolean parameter to scan reverse or forward. It relies on bitScanReverse, but conditionally masks the LS1B in case of scanning forward. It might be used in the classical approach to get positive or negative ray directions with one generalized routine.
/** * generalized bitScan * @author Gerd Isenberg * @param bb bitboard to scan * @precondition bb != 0 * @param reverse, true bitScanReverse, false bitScanForward * @return index (0..63) of least/most significant one bit */ int bitScan(U64 bb, bool reverse) { U64 rMask; assert (bb != 0); rMask = (U64)reverse; bb &= bb  rMask; return bitScanReverse(bb); }
Processor Instructions for Bitscans
x86
x8664 processors have bitscan instructions and can be accessed with compilers today through either inline assembly or compiler intrinsics. For the Microsoft/Intel C compiler, the intrinsics can be accessed by including and using the instructions _BitScanForward64 ^{[25]} , _BitScanReverse64 ^{[26]} or _lzcnt64 ^{[27]} .
unsigned char_BitScanForward64(unsigned long * Index, unsigned __int64 Mask); unsigned char _BitScanReverse64(unsigned long * Index, unsigned __int64 Mask); unsigned __int64 __lzcnt64(unsigned __int64 value); // AMD K10 only see CPUID
Linux provides library functions ^{[28]} , find first bit set (ffsll) in a word leaves an index of 1..64, and zero of no bit is set ^{[29]} . GCC 4.4.5 further has the Builtin Function _builtin_ffsll for finding the least significant one bit, _builtin_ctzll for trailing, and _builtin_clzll for leading zero count ^{[30]} :
/* Returns one plus the index of the least significant 1bit of x, or if x is zero, returns zero */ int __builtin_ffsll (unsigned long long); /* Returns the number of trailing 0bits in x, starting at the least significant bit position. If x is 0, the result is undefined */ int __builtin_ctzll (unsigned long long); /* Returns the number of leading 0bits in x, starting at the most significant bit position. If x is 0, the result is undefined */ int __builtin_clzll (unsigned long long);
Emulating Intrinsics
For the GNU C compiler, the intrinsics can be emulated with inline assembly ^{[31]} .
//These processor instructions work only for 64bit processors #ifdef _MSC_VER #include <intrin.h> #ifdef _WIN64 #pragma intrinsic(_BitScanForward64) #pragma intrinsic(_BitScanReverse64) #define USING_INTRINSICS #endif #elif defined(__GNUC__) && defined(__LP64__) static INLINE unsigned char _BitScanForward64(unsigned long* Index, U64 Mask) { U64 Ret; __asm__ ( "bsfq %[Mask], %[Ret]" :[Ret] "=r" (Ret) :[Mask] "mr" (Mask) ); *Index = (unsigned long)Ret; return Mask?1:0; } static INLINE unsigned char _BitScanReverse64(unsigned long* Index, U64 Mask) { U64 Ret; __asm__ ( "bsrq %[Mask], %[Ret]" :[Ret] "=r" (Ret) :[Mask] "mr" (Mask) ); *Index = (unsigned long)Ret; return Mask?1:0; } #define USING_INTRINSICS #endif
Intrinsics versus asm
Alternatively, rather than to emulate the intrinsics one might use the standard prototype, by using intrinsics or inline assembly for GCC ^{[32]} :
#ifdef USE_X86INTRINSICS #include <intrin.h> #pragma intrinsic(_BitScanForward64) #pragma intrinsic(_BitScanReverse64) /** * bitScanForward * @param bb bitboard to scan * @precondition bb != 0 * @return index (0..63) of least significant one bit */ int bitScanForward(U64 x) { unsigned long index; assert (x != 0); _BitScanForward64(&index, x); return (int) index; } /** * bitScanReverse * @param bb bitboard to scan * @precondition bb != 0 * @return index (0..63) of most significant one bit */ int bitScanReverse(U64 x) { unsigned long index; assert (x != 0); _BitScanReverse64(&index, x); return (int) index; } #else /** * bitScanForward * @param bb bitboard to scan * @precondition bb != 0 * @return index (0..63) of least significant one bit */ int bitScanForward(U64 x) { assert (x != 0); asm ("bsfq %0, %0" : "=r" (x) : "0" (x)); return (int) x; } /** * bitScanReverse * @param bb bitboard to scan * @precondition bb != 0 * @return index (0..63) of most significant one bit */ int bitScanReverse(U64 x) { assert (x != 0); asm ("bsrl %0, %0" : "=r" (x) : "0" (x)); return (int) x; } #endif
Bsf/Bsr x8664 Timings
The instruction latency and reciprocal throughput ^{[33]} heavily differs between various x8664 architectures:
Architecture Stepping  Instruction(s)  Latency / Cycles  Reciprocal Throughput 

AMD  
K8 ^{[34]}  BSF reg16/32/64, mreg16/32/64  Vector Path 8/8/9  8/8/9 
BSR reg16/32/64, mreg16/32/64  Vector Path 11  11  
K10 ^{[35]}  BSF reg, reg  Vector Path 4  4 
BSR reg, reg  Vector Path 4  4  
LZCNT reg, reg  Direct Path single 2  1  
Intel ^{[36]}  
ATOM  BSF/BSR  16  15 
NetBurst 0F_3H  BSF/BSR  16  4 
NetBurst 0F_2H  BSF/BSR  8  2 
Core 06_0EH  BSF/BSR  2  1 
65 nm Intel Core 06_0FH  BSF/BSR  2  1 
Enhanced Intel Core 06_17H  BSF/BSR  1  1 
Enhanced Intel Core 06_1DH  BSF/BSR  1  1 
Nehalem 06_1AH  BSF/BSR  3  1 
Sandy Bridge 
BSF/BSR  3  1 
LZCNT  3  1  
TZCNT  3  1 
Bsf/Bsr behavior with zero source
Intel and AMD specify different behavior. In praxis there seems no difference so far. However, as long as Intel docs explicitly state content undefined, it is recommend to don't rely on a preinitialized content of that target register, if the source is zero.
 Intel : If the content of the source operand is 0, the content of the destination operand is undefined. ^{[37]}
 AMD: If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register. ^{[38]}
ARM
ARM has CLZ (Count Leading Zeros) instruction for 32bit integers. ARM instruction is available in ARMv5 and above, 32bit Thumb instruction is available in ARMv6T2 and ARMv7 ^{[39]} , the Cintrinsic is called _builtin_clz ^{[40]} ^{[41]} ^{[42]} .
Engine Samples
 BitScan in Amundsen
 BitScan in Chess 0.5
 BitScan in CookieCat
 BitScan in Crafty (23.5)
 BitScan in Gibbon
 BitScan in Gk
 BitScan in HeavyChess
 BitScan in Kurt
 BitScan in Murka
 BitScan in Prophet
 BitScan in RedQueen
 BitScan in Spector
See also
 Bitboard Serialization
 BITSCAN, a C++ library for bitstrings by Pablo San Segundo
 BitTwiddling
 De Bruijn Sequence Generator
 JavaBitscan
 Population Count
Publications
 Alan Turing (1949). Alan Turing's Manual for the Ferranti Mk. I. transcribed in 2000 by Robert Thau, pdf from The Computer History Museum, 9.4 The position of the most significant digit » Ferranti Mark 1
 Charles E. Leiserson, Harald Prokop and Keith H. Randall (1998). Using de Bruijn Sequences to Index a 1 in a Computer Word, pdf
 Pablo San Segundo, Ramón Galán (2005). Bitboards: A New Approach. AIA 2005
 Donald Knuth (2009). The Art of Computer Programming, Volume 4, Fascicle 1: Bitwise tricks & techniques, as PreFascicle 1a postscript, p. 10
 Andreas Stiller (2013). Spezialkommando  Bits setzen, abfragen, scannen und mehr. c't Magazin für Computertechnik 7/2013, p. 186 (German)
Forum Posts
1996 ...
 Bitboards: speeding up FirstOne by Laurent Desnogues, rgcc, April 10, 1996 » Othello
 bitboard 2^i mod 67 is unique by Stefan Plenkner, rgcc, August 6, 1996
 bitboard 2^i mod 67 is unique by Stefan Plenkner, rgcc, August 7, 1996
 bitboard 2^i mod 67 is unique by Joël Rivat, rgcc, September 2, 1996
 Question to Bob: Crafty , Alpha and FindBit() by Guido Schimmels, CCC, June 05, 1998 » Crafty, DEC Alpha
 To Nalimov and other programmers about BSF/BSR in VC by Dezhi Zhao, CCC, January 16, 1999
2000 ...
 Re: TASM 5.0 versus BSF by Frans Morsch, comp.lang.asm.x86, March 28, 2000
 Will the Itanium have a BSF or BSR instruction? by Larry Griffiths, CCC, August 15, 2000
 Re: Will the Itanium have a BSF or BSR instruction? by Eugene Nalimov, CCC, August 16, 2000
 Binary question by Severi Salminen, CCC, October 19, 2000
 Bitboards and Piece Lists by Dann Corbit, CCC, June 14, 2001
 FirstBit() in assembler by David Rasmussen, CCC, January 13, 2002
 Reply from Intel about BSF/BSR by Severi Salminen, CCC, January 31, 2002
 "Using de Bruijn Sequences to Index a 1 in a Computer Word" by Oliver Roese, CCC, February 08, 2002
 Extracting bits from a BitBoard... by Joel Veness, CCC, November 17, 2002
 Another hacky method for bitboard bit extraction by Walter Faxon, CCC, November 17, 2002
 Modulo verus BitScan and MMXPopCount by Gerd Isenberg, CCC, November 29, 2002
 Fast 3DNow! BitScan by Gerd Isenberg, CCC, December 01, 2002
 Bitscan Conclusions by Matt Taylor, CCC, January 05, 2003
 Bitscan by Matt Taylor, CCC, February 11, 2003
 FirstOne for Linux by Sune Fischer, CCC, March 29, 2003
 Bit magic by Matt Taylor, comp.lang.asm.x86, June 26, 2003
 Re: Bit magic by Matt Taylor, comp.lang.asm.x86, June 29, 2003
 Re: De Bruijn Sequence Generator by Dieter Bürßner, CCC, December 30, 2003 » De Bruijn Sequence Generator
 Determining location of LSB/MSB by Renze Steenhuisen, CCC, February 09, 2004
 Nalimov: bsf/bsr intrinsics implementation still not optimal by Dezhi Zhao, CCC, September 22, 2004
 Re: Nalimov: bsf/bsr intrinsics implementation still not optimal by Eugene Nalimov, CCC, September 23, 2004
2005 ...
 A data point for PowerPC bitboard program authors by Steven Edwards, CCC, May 09, 2005 » PowerPC
 Best BitBoard LSB funktion? by Reinhard Scharnagl, Winboard Programming Forum, July 20, 2005
 Fastest bitboard compress routine when you can't use ASM by mambofish, CCC, May 31, 2007
 Bit twiddling question, part 2: arbitrary bitscan order by Zach Wegner, CCC, August 11, 2009
 32 bit versions for bitscan64 by Michael Hoffmann, CCC, August 21, 2009
 64bit intrinsic performance by Nathan Thom, CCC, October 27, 2009
 Bit Scan (equivalent to ASM instructions bsr and bsf) by Pascal Georges, CCC, December 24, 2009
2010 ...
 bitScanReverse32 by Luca Hemmerich, CCC, January 25, 2010
 Introduction and (hopefully) contribution  bitboard methods by Alcides Schulz, CCC, June 03, 2011 » Population Count
 First One by José Carlos, CCC, May 24, 2012
 Leading Zero Count Question by Matthew R. Brades, CCC, September 16, 2012
 Optimizing bitboards for ARM by Martin Sedlak, CCC, November 17, 2012
 Symmetric move generation using bitboards by Lasse Hansen, CCC, December 20, 2014
 Stockfish 32bit and hardware instructions on MSVC++ by Syed Fahad, CCC, December 30, 2014 » Stockfish, BitScan, Population Count
2015 ...
 Fun with De Bruijn by Henk van den Belt, CCC, August 27, 2015
 Re: Linux Version of Maverick 1.5 by Michael Dvorkin, CCC, November 12, 2015 » OS X, Maverick
 syzygy users (and Ronald) by Robert Hyatt, CCC, September 29, 2016 » Population Count
 CPW bitscan with reset could someone explain this line? by Mahmoud Uthman, CCC, March 14, 2019
2020 ...
 Looking for intrinsic "least significant bit" on Visual Studio by Oliver Brausch, CCC, September 03, 2020 » Trailing Zero Count
 C++20 standard bit operations by Jon Dart, CCC, November 15, 2020 » General Setwise Operations, Population Count, C++
 Optimizing Matt Taylor's Folding trick by Daniel Infuehr, CCC, April 21, 2022
External Links
 Find first set from Wikipedia
 The Aggregate Magic Algorithms by Hank Dietz
 Bit Twiddling Hacks by Sean Eron Anderson
 An Efficient BitReversal Sorting Algorithm for the Fast Fourier Transform by Jennifer Elaan, January 16, 2005
 Efficient bit scan mechanism  United States Patent 6172623 from FreePatentsOnline.com
 Frank Zappa & the Mothers  King Kong BBC Studio Recording 1968, YouTube Video
References
 ↑ Picture gallery "Back in Holland 1941  1954" from The Official M.C. Escher Website
 ↑ Chip Architect: Detailed Architecture of AMD's Opteron  1.3 A third class of Instructions by Hans de Vries
 ↑ Looking for intrinsic "least significant bit" on Visual Studio by Oliver Brausch, CCC, September 03, 2020
 ↑ Donald Knuth (2009). The Art of Computer Programming, Volume 4, Fascicle 1: Bitwise tricks & techniques, as PreFascicle 1a postscript, p 10
 ↑ Charles E. Leiserson, Harald Prokop and Keith H. Randall (1998). Using de Bruijn Sequences to Index a 1 in a Computer Word, pdf
 ↑ "Using de Bruijn Sequences to Index a 1 in a Computer Word" discussion in CCC, February 08, 2002
 ↑ N. G. de Bruijn (1975). Acknowledgement of priority to C. Flye SainteMarie on the counting of circular arrangements of 2n zeros and ones that show each nletter word exactly once. Technical Report, Technische Hogeschool Eindhoven, available as pdf reprint
 ↑ Bit magic by Matt Taylor, comp.lang.asm.x86, June 26, 2003
 ↑ Re: Bit magic by Matt Taylor, comp.lang.asm.x86, June 29, 2003
 ↑ Another hacky method for bitboard bit extraction by Walter Faxon, CCC, November 17, 2002
 ↑ bitboard 2^i mod 67 is unique by Stefan Plenkner, rgcc, August 6, 1996
 ↑ Pablo San Segundo, Ramón Galán (2005). Bitboards: A New Approach. AIA 2005
 ↑ Best BitBoard LSB funktion? by Reinhard Scharnagl, Winboard Programming Forum, July 20, 2005
 ↑ Re: Will the Itanium have a BSF or BSR instruction? by Eugene Nalimov, CCC, August 16, 2000
 ↑ ms1bTable array in Eugene Nalimovs bitScanReverse by Stef Luijten, CCC, April 17, 2011
 ↑ just another reverse bitscan by Gerd Isenberg, CCC, December 22, 2005
 ↑ final version  homage to FZ by Gerd Isenberg, CCC, December 23, 2005
 ↑ EuroPython 2012: Florence, July 2–8  Mark Dickinson
 ↑ Find the log base 2 of an Nbit integer in O(lg(N)) operations with multiply and lookup from Bit Twiddling Hacks by Sean Eron Anderson
 ↑ 68020 Bit Field Instructions
 ↑ SSE4a from Wikipedia
 ↑ __lzcnt16, __lzcnt, __lzcnt64 Visual C++ Language Reference
 ↑ Bitscan by Matt Taylor, CCC, February 11, 2003
 ↑ CPW bitscan with reset could someone explain this line? by Mahmoud Uthman, CCC, March 14, 2019
 ↑ _BitScanForward, _BitScanForward64 Visual C++ Language Reference
 ↑ _BitScanReverse, _BitScanReverse64 Visual C++ Language Reference
 ↑ __lzcnt16, __lzcnt, __lzcnt64 Visual C++ Language Reference
 ↑ Section 3: library functions  Linux man pages
 ↑ ffsll(3): find first bit set in word  Linux man page
 ↑ Other Builtins  Using the GNU Compiler Collection (GCC)
 ↑ Re: Nalimov: bsf/bsr intrinsics implementation still not optimal by Eugene Nalimov, CCC, September 23, 2004
 ↑ Matters Computational  ideas, algorithms, source code (pdf) Ideas and Source Code by Jörg Arndt
 ↑ Instruction tables, Lists of instruction latencies, throughputs and microoperation breakdowns for Intel and AMD CPU's (pdf) by Agner Fog
 ↑ Software Optimization Guide for AMD64 Processors
 ↑ Software Optimization Guide for AMD Family 10h and 12h Processors
 ↑ Intel 64 and IA32 Architectures Optimization Reference Manual
 ↑ Intel® 64 and IA32 Architectures Software Developer’s Manual Volume 2A: Instruction Set Reference, AM (pdf) BSF—Bit Scan Forward 387
 ↑ AMD64 Architecture Programmer’s Manual Volume 3: GeneralPurpose and System Instructions (pdf) Bit Scan Forward pg. 111
 ↑ ARM Information Center > General data processing instructions > CLZ
 ↑ ARM Information Center > Instruction intrinsics > __builtin_clz
 ↑ Other Builtins  Using the GNU Compiler Collection (GCC)
 ↑ Bit Scan (equivalent to ASM instructions bsr and bsf) by Pascal Georges, CCC, December 24, 2009