Changes

Newer edit →

Parallel Prefix Algorithms

18,943 bytes added, 21:03, 10 May 2018

Created page with "'''Home * Programming * Algorithms * Parallel Prefix Algorithms''' FILE:parallelprefixsum.png|border|right|thumb|link=http://www.toves.org/books/dista..."

'''[[Main Page|Home]] * [[Programming]] * [[Algorithms]] * Parallel Prefix Algorithms'''

[[FILE:parallelprefixsum.png|border|right|thumb|link=http://www.toves.org/books/distalg/|Parallel Prefix Sum <ref>[http://ozark.hendrix.edu/~burch/ Carl Burch] ('''2009'''). ''[http://www.toves.org/books/distalg/ Introduction to parallel & distributed algorithms]''. On-line Book</ref> ]]

'''Parallel prefix algorithms''' compute all [https://en.wikipedia.org/wiki/Prefix_sum prefixes] of a input sequence in [https://en.wikipedia.org/wiki/Time_complexity#Logarithmic_time logarithmic time], and are topic of various [[SIMD and SWAR Techniques|SIMD and SWAR techniques]] applied to [[Bitboards|bitboards]] <ref>[[#Seealso|See also]]</ref> . This page provides some basics on simple parallel prefix problems, like parity words and [https://en.wikipedia.org/wiki/Gray_code Gray code] with some interesting properties, followed by some theoretical background on more complex parallel prefix problems, like [[Kogge-Stone Algorithm|Kogge-Stone]] by [[Steffan Westcott]], and a comparison of three similar Kogge-Stone routines for different purpose.

=Parity Words=
[https://en.wikipedia.org/wiki/Parity_bit#Parity Parity] is often related to [https://en.wikipedia.org/wiki/Error_detection_and_correction error detection and correction]. The modulo 2 sum aka [[General Setwise Operations#ExclusiveOr|exclusive or]] (xor) of any number of inputs is sufficient to determine even or odd parity. On a bitboard for each bit 0..63 we like to know, whether the number of ones including all trailing bits is odd or even, xi is the bit in x with bit-index i:
<pre>
p0 = x0;
p1 = x0 ^ x1;
p2 = x0 ^ x1 ^ x2;
...
p63 = x0 ^ x1 ^ x2 ^ .... ^ x63;
</pre>
Or [[Recursion|recursively]]
<pre>
p[i] = p[i-1] ^ x[i];
</pre>

p63 indicates whether the whole cardinality is odd or even. The parallel prefix solution looks that way:
<pre>
x ^= x << 1;
x ^= x << 2;
x ^= x << 4;
x ^= x << 8;
x ^= x << 16;
x ^= x << 32;
</pre>
and only need <code>log2(64) == 6</code> steps to perform all the xor instructions.

Surprisingly, thanks to xor, we can restore the initial value by a final:
<pre>
x ^= x << 1;
</pre>

=Gray Code=
The reflected binary code, or [https://en.wikipedia.org/wiki/Gray_code Gray code] has some similar properties <ref>[http://aggregate.org/MAGIC/#Gray%20Code%20Conversion Gray Code Conversion] from [http://aggregate.org/MAGIC The Aggregate Magic Algorithms] by [[Hank Dietz]]</ref>. The Gray code is defined by the modulo 2 sum or [[General Setwise Operations#ExclusiveOr|exclusive or]] of a word with its value shifted right by one.
<pre>
gray = x ^ (x >> 1);
</pre>
This ensures that consecutive Gray codes have only one bit changed, that is a [[Population Count#HammingDistance|Hamming distance]] of one. For instance for a 4-bit or [[Nibble|nibble]] Gray code:
{| class="wikitable"
|-
! x decimal
! x binary
! x>>1
! Gray bin
! decimal
|-
| style="text-align:right;" | 0
| style="text-align:center;" | 0000
| style="text-align:center;" | 0000
| style="text-align:center;" | 0000
| style="text-align:right;" | 0
|-
| style="text-align:right;" | 1
| style="text-align:center;" | 0001
| style="text-align:center;" | 0000
| style="text-align:center;" | 0001
| style="text-align:right;" | 1
|-
| style="text-align:right;" | 2
| style="text-align:center;" | 0010
| style="text-align:center;" | 0001
| style="text-align:center;" | 0011
| style="text-align:right;" | 3
|-
| style="text-align:right;" | 3
| style="text-align:center;" | 0011
| style="text-align:center;" | 0001
| style="text-align:center;" | 0010
| style="text-align:right;" | 2
|-
| style="text-align:right;" | 4
| style="text-align:center;" | 0100
| style="text-align:center;" | 0010
| style="text-align:center;" | 0110
| style="text-align:right;" | 6
|-
| style="text-align:right;" | 5
| style="text-align:center;" | 0101
| style="text-align:center;" | 0010
| style="text-align:center;" | 0111
| style="text-align:right;" | 7
|-
| style="text-align:right;" | 6
| style="text-align:center;" | 0110
| style="text-align:center;" | 0011
| style="text-align:center;" | 0101
| style="text-align:right;" | 5
|-
| style="text-align:right;" | 7
| style="text-align:center;" | 0111
| style="text-align:center;" | 0011
| style="text-align:center;" | 0100
| style="text-align:right;" | 4
|-
| style="text-align:right;" | 8
| style="text-align:center;" | 1000
| style="text-align:center;" | 0100
| style="text-align:center;" | 1100
| style="text-align:right;" | 12
|-
| style="text-align:right;" | 9
| style="text-align:center;" | 1001
| style="text-align:center;" | 0100
| style="text-align:center;" | 1101
| style="text-align:right;" | 13
|-
| style="text-align:right;" | 10
| style="text-align:center;" | 1010
| style="text-align:center;" | 0101
| style="text-align:center;" | 1111
| style="text-align:right;" | 15
|-
| style="text-align:right;" | 11
| style="text-align:center;" | 1011
| style="text-align:center;" | 0101
| style="text-align:center;" | 1110
| style="text-align:right;" | 14
|-
| style="text-align:right;" | 12
| style="text-align:center;" | 1100
| style="text-align:center;" | 0110
| style="text-align:center;" | 1010
| style="text-align:right;" | 10
|-
| style="text-align:right;" | 13
| style="text-align:center;" | 1101
| style="text-align:center;" | 0110
| style="text-align:center;" | 1011
| style="text-align:right;" | 11
|-
| style="text-align:right;" | 14
| style="text-align:center;" | 1110
| style="text-align:center;" | 0111
| style="text-align:center;" | 1001
| style="text-align:right;" | 9
|-
| style="text-align:right;" | 15
| style="text-align:center;" | 1111
| style="text-align:center;" | 0111
| style="text-align:center;" | 1000
| style="text-align:right;" | 8
|-
| style="text-align:right;" | 0
| style="text-align:center;" | 0000
| style="text-align:center;" | 0000
| style="text-align:center;" | 0000
| style="text-align:right;" | 0
|}

Decoding Gray code is a parallel prefix problem similar to the ring-wise operations of the [[#ParityWords|Parity words]]. For a nibble Gray code:
<pre>
x = gray;
x ^= x >> 2;
x ^= x >> 1;
</pre>
Or for 64-bit Gray-Codes:
<pre>
x = gray;
x ^= x >> 32;
x ^= x >> 16;
x ^= x >> 8;
x ^= x >> 4;
x ^= x >> 2;
x ^= x >> 1;
</pre>
=Fill Stuff=
Parallel prefix [[Pawn Fills|front and rear-fills]], for instance to determine [[Pawn Spans|pawn spans]] in [[Bitboards|bitboards]], are quite simple to understand.
<pre>
U64 nortFill(U64 gen) {
gen |= (gen << 8);
gen |= (gen << 16);
gen |= (gen << 32);
return gen;
}

U64 soutFill(U64 gen) {
gen |= (gen >> 8);
gen |= (gen >> 16);
gen |= (gen >> 32);
return gen;
}
</pre>
This is actually a simplified [[Kogge-Stone Algorithm|Kogge-Stone algorithm]], whose general form even considers a second bitboard to occlude the flood accordantly.


=Elaborations on Kogge-Stone=
Elaborations on [[Kogge-Stone Algorithm|Kogge-Stone]] parallel prefix algorithm by [[Steffan Westcott]] <ref>[https://www.stmintz.com/ccc/index.php?id=252289 flood fill attack bitboards] by [[Steffan Westcott]] from [[Computer Chess Forums|CCC]], September 15, 2002</ref> :

A '''parallel prefix''' problem is of the sort:
<pre>
Given the terms x1, x2, x3, ... , xN and an associative operator #
find the values q1 = x1
q2 = x1 # x2
q3 = x1 # x2 # x3
.
.
.
qN = x1 # x2 # x3 # ... # xN
</pre>
Associative expressions can be bracketed any way you like, the result is the same. To see why this is interesting for chess programming, let's define x1, x2, ... and # to be
<pre>
xI = < gI, pI > (a two element tuple)
where gI = square aI has a rook on it
and pI = square aI is empty
for all I = 1, 2, 3, ... , 8

and xI # xJ = < gI, pI > # < gJ, pJ >
= < ((gI && pJ) || gJ), (pI && pJ) >
</pre>
All this algebra looks very scary at first, so here's an example:
<pre>
q2 = x1 # x2 = < rook_on_a1, a1_is_empty > # < rook_on_a2, a2_is_empty >
= < ((rook_on_a1 && a2_is_empty) || rook_on_a2),
(a1_is_empty && a2_is_empty) >
</pre>
* The first tuple of q2 tells us if a rook is on a2 or can move up to a2 along empty squares.
* The second tuple of q2 tells us if a rook could move up freely through a1-a2 ie. a1-a2 are empty.

In general,
<pre>
xI # x(I+1) # ... # xJ =
< a_rook_somewhere_on_aI_to_aJ_is_either_on_aJ_or_can_move_up_to_aJ,
all_squares_aI_to_aJ_are_empty >
</pre>
and so
<pre>
qJ = < a_rook_somewhere_in_file_a_is_either_on_aJ_or_can_move_up_to_aJ,
all_squares_a1_to_aJ_are_empty >
</pre>
The tuples g and p are known as the "generator" and "propagator" terms in the literature of fast carry propagation circuits. But we are not dealing with carry bits in a carry propagate adder! Here, we are generating and propagating upward rook attacks along a file of a chessboard.

Why all this theory? Well, prefix computation is a heavily researched area, researched by many folk smarter than me :) Its of interest because it has many applications, such as VLSI design, pattern matching, and others. There are many different ways of going about it, with different implementation characteristics.
<pre>
U64 FillUpOccluded(U64 g, U64 p) {
g |= p & (g << 8);
p &= (p << 8);
g |= p & (g << 16);
p &= (p << 16);
return g |= p & (g << 32);
}

U64 RookAttacksUp(U64 rooks, U64 empty_squares) {
return ShiftUp(FillUpOccluded(rooks, empty_squares));
}
</pre>
The method chosen in FillUpOccluded() is based on a Kogge-Stone parallel prefix network, because it can be implemented very easily in software. The diagram below (trust me, it really _is_ supposed to look like that) is an illustration of how it works. The corresponding lines of program code are shown on the right. The inputs are fed into the network at the top, pass along the connecting lines, are combined by the # operator at various points, and the outputs appear at the bottom.
<pre>
x1 x2 x3 x4 x5 x6 x7 x8 Input : g, p
| | | | | | | |
V V V V V V V V
| | | | | | | |
| | | | | | | |
|\ |\ |\ |\ |\ |\ |\ |
| \| \| \| \| \| \| \|
| # # # # # # # g |= p & (g << 8);
| | | | | | | | p &= (p << 8);
|\ |\ |\ |\ |\ |\ | |
| \: \: \: \: \: \: |
| \ \ \ \ \ \ |
| :\ :\ :\ :\ :\ :\ |
| | \| \| \| \| \| \|
| | # # # # # # g |= p & (g << 16);
| | | | | | | | p &= (p << 16);
|\ |\ |\ |\ | | | |
| \: \: \: \: | | |
| \ \ \ \ | | |
| :\ :\ :\ :\ | | |
| | \: \: \: \: | |
| | \ \ \ \ | |
| | :\ :\ :\ :\ | |
| | | \: \: \: \: |
| | | \ \ \ \ |
| | | ;\ ;\ :\ :\ |
| | | | \| \| \| \|
| | | | # # # # g |= p & (g << 32);
| | | | | | | |
| | | | | | | |
V V V V V V V V
q1 q2 q3 q4 q5 q6 q7 q8 Output : g
</pre>
To convince yourself this works, select any output qI and trace upwards from the bottom of the diagram. You'll see it leads back to x1, x2, ... , xI, so qI is formed by some computation of x1 # x2 # ... # xI.

[Implementor's note : The 2 program lines that assign to variable p are not strictly correct. By the end of the routine, p1 and p2 have been trashed (reset). However, they are trashed after they can affect the correctness of the routine result.]


=Add/Sub versus Attacks=
As a comparison three Kogge-Stone routines. A "classical" byte-wise [[SIMD and SWAR Techniques#SWAR|SWAR]] adder, byte-wise subtraction, and byte-wise rook-attacks in east direction. Note that the inner parallel prefix instructions are the same in all three routines:
<pre>
gen |= pro & (gen << 1);
pro &= (pro << 1);
gen |= pro & (gen << 2);
pro &= (pro << 2);
gen |= pro & (gen << 4);
</pre>
The difference are the assignments of generator and propagator, and the final result.

==Add==
[[FILE:300px-4_bit_Kogge_Stone_Adder_Example_new.png|none|border|text-bottom|link=https://en.wikipedia.org/wiki/Kogge-Stone_adder|4-bit Kogge-Stone adder]]
The software approach of a [https://en.wikipedia.org/wiki/Kogge-Stone_adder Kogge-Stone] [[Combinatorial Logic#Adder|hardware adder]] <ref>[http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html#fsa_pfx Hardware algorithms for arithmetic modules] from the ARITH research group, Aoki lab., [https://en.wikipedia.org/wiki/Tohoku_University Tohoku University]</ref> , requires the initial carry bits as generator, thus the [[General Setwise Operations#Intersection|intersection]] of both summands, while the propagator is the modulo 2 sum, the [[General Setwise Operations#ExclusiveOr|exclusive or]] of both summands. To avoid inter byte overflows, the propagator has the least significant bits of each byte cleared. However the final carries need one further masked shift and the modulo two sum with the initial propagator.
<pre>
// SIMD bytewise add a + b
U64 koggeStoneByteAdd(U64 a, U64 b) {
U64 gen, pro;
gen = a & b;
pro = a ^ b;
pro &= ~aFile;
gen |= pro & (gen << 1);
pro &= (pro << 1);
gen |= pro & (gen << 2);
pro &= (pro << 2);
gen |= pro & (gen << 4);
gen =~aFile & (gen << 1);
return gen ^ a ^ b;
}
</pre>
==Sub==
Subtraction is based on the [[General Setwise Operations#TheTwosComplement|two's complement]] which is [[General Setwise Operations#ComplementSet|ones' complement]] plus one aka the A-File assuming [[Square Mapping Considerations#LittleEndianRankFileMapping|little-endian rank-file mapping]]. Thus, it is actually the sum of {a, ~b, one}, which is considered in calculating propagator and generator, which is the [[General Setwise Operations#Majority|majority]] of all three inputs. Note that
<pre>
gen =~aFile & (gen << 1);
return gen ^ a ^ ~b ^ aFile;
</pre>
can be re-written like in following routine:
<pre>
// SIMD bytewise sub a - b
U64 koggeStoneByteSub(U64 a, U64 b) {
U64 gen, pro;
gen = a & ~b; // based on -b = ~b + 1
pro = a ^ ~b;
gen |= aFile & pro; // majority
gen |= pro & (gen << 1);
pro &= (pro << 1);
gen |= pro & (gen << 2);
pro &= (pro << 2);
gen |= pro & (gen << 4);
gen = aFile | (gen << 1);
return gen ^ a ^ ~b;
}
</pre>
==East Attacks==
Here the Kogge-Stone algorithm generates the sliding piece attacks along the empty squares. Generator are the rooks or queens, propagator the empty squares without the A-file to avoid wraps.
<pre>
U64 eastAttacks(U64 occ, U64 rooks) {
U64 gen, pro;
pro = ~occ & ~aFile;
gen = rooks;
gen |= pro & (gen << 1);
pro &= (pro << 1);
gen |= pro & (gen << 2);
pro &= (pro << 2);
gen |= pro & (gen << 4);
gen =~aFile & (gen << 1);
return gen;
}
</pre>
The mentioned routines also demonstrate, that this east attacking direction may cheaper implemented with [[Fill by Subtraction]] based on [[SIMD and SWAR Techniques#SWAR|SWAR-wise techniques]] and [[Subtracting a Rook from a blocking Piece|subtracting a rook from a blocking piece]]:
<pre>
U64 byteAdd(U64 a, U64 b) {return ((a & ~hFile) + (b & ~hFile)) ^ ((a ^ b) & hFile);}
U64 byteSub(U64 a, U64 b) {return ((a | hFile) - (b & ~hFile)) ^ ((a ^ ~b) & hFile);}
</pre>
However, due to various left and right shifts, Kogge-Stone can deal with all other seven sliding directions as well.


=See also=
* [[General Setwise Operations#parallelPrefixMSB|Parallel prefix MSB]]
* [[Population Count#SWARPopcount|SWAR population count]] of [[Population Count#TooSlow|the too slow loop approach]]
* [[Flipping Mirroring and Rotating#parallelPrefixFlip|Flipping and Mirroring]]
* [[Pawn Fills]]
* [[Occupancy of any Line]]
* [[Kogge-Stone Algorithm]]

=Publications=
* [http://www.cse.nd.edu/%7Ekogge/ Peter M. Kogge], [[Mathematician#HSStone|Harold S. Stone]] ('''1973'''). ''A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations''. IEEE Transactions on Computers, 1973, C-22, 783-791, [http://bwrc.eecs.berkeley.edu/Classes/ee225c/2000%20225c/Papers/KoggeStone.pdf pdf]
* [http://www.cs.washington.edu/homes/ladner/ Richard E. Ladner], [http://cs-www.cs.yale.edu/homes/fischer/ Michael J. Fischer] ('''1980'''). ''Parallel Prefix Computation''. [http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=5EBA05F3F708ABC617EE34DD08B65485?doi=10.1.1.106.6247&rep=rep1&type=pdf pdf] via [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.6247 CiteSeerX]
* [[Clyde Kruskal]], [http://necsi.edu/faculty/rudolph.html Larry Rudolph], [http://cs.illinois.edu/people/faculty/marc-snir Marc Snir] ('''1985'''). ''The power of parallel prefix''. [https://en.wikipedia.org/wiki/IEEE_Transactions_on_Computers IEEE Transactions on Computers], Vol. C-34
* [[Peter Sanders]], [http://www.traff-industries.de/ Jesper Larsson Träff] ('''2006'''). ''Parallel Prefix (Scan) Algorithms for MPI''. in EuroPVM/MPI 2006, LNCS, [http://algo2.iti.uni-karlsruhe.de/sanders/papers/scan.pdf pdf]
* [http://ozark.hendrix.edu/~burch/ Carl Burch] ('''2009'''). ''[http://www.toves.org/books/distalg/ Introduction to parallel & distributed algorithms]''. On-line Book

=Forum Posts=
* [https://www.stmintz.com/ccc/index.php?id=252289 flood fill attack bitboards] by [[Steffan Westcott]] from [[Computer Chess Forums|CCC]], September 15, 2002
* [http://www.talkchess.com/forum/viewtopic.php?start=0&t=25979&start=10 Re: Hyperbola Quiesscene: hardly any improvement] by [[Karlo Bala Jr.]], [[CCC]], January 14, 2009 » [[Hyperbola Quintessence]]

=External Links=
* [http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html#fsa_pfx Hardware algorithms for arithmetic modules] from the ARITH research group, Aoki lab., [https://en.wikipedia.org/wiki/Tohoku_University Tohoku University]
* [https://en.wikipedia.org/wiki/Prefix_sum Prefix sum from Wikipedia]
* [http://www.itl.nist.gov/div897/sqg/dads/HTML/parallelPrefix.html parallel prefix computation] from [http://www.nist.gov/index.html National Institute of Standards and Technology]
* [http://everything2.com/title/parallel+prefix+algorithm parallel prefix algorithm] from [http://everything2.com/ Everything2.com]
* [http://charm.cs.uiuc.edu/tutorial/ParallelPrefix.htm Charm++ Tutorial - Parallel Prefix Program]
* [[Videos#JoeZawinul|Joe Zawinul]], [[Videos#TrilokGurtu|Trilok Gurtu]] - Orient Express Part1, [https://en.wikipedia.org/wiki/YouTube YouTube] Video
: {{#evu:https://www.youtube.com/watch?v=922LumI2ilo|alignment=left|valignment=top}}

=References=
<references />

'''[[Algorithms|Up one Level]]'''

GerdIsenberg

Bureaucrats, Administrators

25,161

edits

Changes

Parallel Prefix Algorithms

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools