Difference between revisions of "SIMD and SWAR Techniques"

From Chessprogramming wiki
Jump to: navigation, search
(One intermediate revision by the same user not shown)
Line 61: Line 61:
 
=See  also=
 
=See  also=
 
* [[GPU]]
 
* [[GPU]]
 +
* [[NNUE]]
 
* [[Parallel Prefix Algorithms]]
 
* [[Parallel Prefix Algorithms]]
  
Line 108: Line 109:
 
* [https://en.wikipedia.org/wiki/SWAR SWAR from Wikipedia]
 
* [https://en.wikipedia.org/wiki/SWAR SWAR from Wikipedia]
 
* [http://www.aggregate.org/SWAR/ The Aggregate: SWAR, SIMD Within A Register] by [[Hank Dietz]]
 
* [http://www.aggregate.org/SWAR/ The Aggregate: SWAR, SIMD Within A Register] by [[Hank Dietz]]
==[[x86]]/[[x64]]==  
+
==[[x86]]/[[x86-64]]==  
 
* [https://en.wikipedia.org/wiki/MMX_%28instruction_set%29 MMX from Wikipedia]
 
* [https://en.wikipedia.org/wiki/MMX_%28instruction_set%29 MMX from Wikipedia]
 
* [https://en.wikipedia.org/wiki/3DNow 3DNow! from Wikipedia]
 
* [https://en.wikipedia.org/wiki/3DNow 3DNow! from Wikipedia]

Revision as of 18:59, 18 November 2020

Home * Programming * SIMD and SWAR Techniques

x86, x86-64, as well as PowerPC and Power ISA v.2.03 processors provide Single Instructions on Multiple Data (SIMD), namely on vectors of floats, doubles or various integers, bytes, words, double words or quad words, available through assembly and compiler intrinsics. SIMD-applications related to computer chess cover bitboard computations and fill-algorithms like Dumb7Fill and Kogge-Stone Algorithm, as well as evaluation related stuff, like this SSE2 dot-product of 64 bits by a vector of 64 bytes.

SWAR as acronym for SIMD Within A Register was coined by Hank Dietz and Randell J. Fisher [2] . It is a processing model which applies SIMD parallel processing across sections of a CPU register, often vectors of smaller than byte-entities are processed in parallel prefix manner.

SIMD Instruction Sets

SWAR Arithmetic

To apply addition and subtraction on vectors of bit-aggregates or bit-field structures within a general purpose register, one has to take care carries and borrows don't wrap around. Thus the need to mask of all most significant bits (H) and add in two steps, one 'add' with MSB clear and one add modulo 2 aka 'xor' for the MSB itself. For bytewise (rankwise) math inside a 64-bit register, H is 0x8080808080808080 and L is 0x0101010101010101.

SWAR add z = x + y
    z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)
SWAR sub z = x - y
    z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)
SWAR average z = (x+y)/2 based on x + y = (x^y) + 2*(x&y)
    z = (x & y) + (((x ^ y) & ~L) >> 1)

Samples

Amazing, how similar these two SWAR- and parallel prefix wise routines are. Mirror horizontally and population count have in common to act on vectors of duos, nibbles and bytes. One swaps bits, duos and nibbles, while the second adds populations of them.

U64 mirrorHorizontal (U64 x) {
    const U64 k1 = C64(0x5555555555555555);
    const U64 k2 = C64(0x3333333333333333);
    const U64 k4 = C64(0x0f0f0f0f0f0f0f0f);
    x = ((x & k1) << 1) | ((x >> 1)  & k1);
    x = ((x & k2) << 2) | ((x >> 2)  & k2);
    x = ((x & k4) << 4) | ((x >> 4)  & k4);
    return x;
}
int popCount (U64 x) {
    const U64 k1 = C64(0x5555555555555555);
    const U64 k2 = C64(0x3333333333333333);
    const U64 k4 = C64(0x0f0f0f0f0f0f0f0f);
    x =   x             - ((x >> 1)  & k1);
    x =  (x & k2)       + ((x >> 2)  & k2);
    x = ( x             +  (x >> 4)) & k4 ;
    x = (x * C64(0x0101010101010101))>> 56;
    return (int) x;
}

See also

Publications

1987 ...

2000 ...

2010 ...

Manuals

AMD

NXP Semiconductors

Intel

Forum Posts

1999

2000 ...

2010 ...

2020 ...

External Links

x86/x86-64

Other

Misc

References

Up one Level