Home * Programming * SIMD and SWAR Techniques

x86, x86-64, as well as PowerPC and Power ISA v.2.03 processors provide Single Instructions on Multiple Data (SIMD), namely on vectors of floats, doubles or various integers, bytes, words, double words or quad words, available through assembly and compiler intrinsics. SIMD-applications related to computer chess cover bitboard computations and fill-algorithms like Dumb7Fill and Kogge-Stone Algorithm, as well as evaluation related stuff, like this SSE2 dot-product of 64 bits by a vector of 64 bytes.

SWAR as acronym for SIMD Within A Register was coined by Hank Dietz and Randell J. Fisher ^[2] . It is a processing model which applies SIMD parallel processing across sections of a CPU register, often vectors of smaller than byte-entities are processed in parallel prefix manner.

SIMD Instruction Sets

MMX on x86 and x86-64
SSE2, SSE3, SSSE3 and SSE4 on x86 and x86-64
SSE5 by AMD (proposed but not implemented, replaced by XOP ^[3])
AltiVec on PowerPC G4, PowerPC G5 resp. VMX since POWER6
VSX since POWER7
Helium by ARM
NEON by ARM
SVE ^[4] and SVE2 ^[5] by ARM
AVX by Intel
AVX2 by Intel
AVX-512 by Intel
XOP by AMD
VIS ^[6] since SPARC v9
RISC-V vector-set extension ^[7]

SWAR Arithmetic

To apply addition and subtraction on vectors of bit-aggregates or bit-field structures within a general purpose register, one has to take care carries and borrows don't wrap around. Thus the need to mask of all most significant bits (H) and add in two steps, one 'add' with MSB clear and one add modulo 2 aka 'xor' for the MSB itself. For bytewise (rankwise) math inside a 64-bit register, H is 0x8080808080808080 and L is 0x0101010101010101.

SWAR add z = x + y
    z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)

SWAR sub z = x - y
    z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)

SWAR average z = (x+y)/2 based on x + y = (x^y) + 2*(x&y)
    z = (x & y) + (((x ^ y) & ~L) >> 1)

Samples

Amazing, how similar these two SWAR- and parallel prefix wise routines are. Mirror horizontally and population count have in common to act on vectors of duos, nibbles and bytes. One swaps bits, duos and nibbles, while the second adds populations of them.

U64 mirrorHorizontal (U64 x) {
    const U64 k1 = C64(0x5555555555555555);
    const U64 k2 = C64(0x3333333333333333);
    const U64 k4 = C64(0x0f0f0f0f0f0f0f0f);
    x = ((x & k1) << 1) | ((x >> 1)  & k1);
    x = ((x & k2) << 2) | ((x >> 2)  & k2);
    x = ((x & k4) << 4) | ((x >> 4)  & k4);
    return x;
}

int popCount (U64 x) {
    const U64 k1 = C64(0x5555555555555555);
    const U64 k2 = C64(0x3333333333333333);
    const U64 k4 = C64(0x0f0f0f0f0f0f0f0f);
    x =   x             - ((x >> 1)  & k1);
    x =  (x & k2)       + ((x >> 2)  & k2);
    x = ( x             +  (x >> 4)) & k4 ;
    x = (x * C64(0x0101010101010101))>> 56;
    return (int) x;
}

Publications

1987 ...

Alan H. Bond (1987). Broadcasting Arrays - A Highly Parallel Computer Architecture Suitable For Easy Fabrication. pdf
Guy E. Blelloch (1990). Vector Models for Data-Parallel Computing. MIT Press, pdf
Randell J. Fisher, Hank Dietz (1998). Compiling for SIMD Within a Register. LCPC 1998, pdf
Tom Thompson (1999). AltiVec Revealed. MacTech, Vol. 15, No. 7

2000 ...

Randell J. Fisher (2003). General-Purpose SIMD Within A Register: Parallel Processing on Consumer Microprocessors. Ph.D. thesis, Purdue University, advisor Hank Dietz, pdf
Daisuke Takahashi (2007). An Implementation of Parallel 1-D FFT Using SSE3 Instructions on Dual-Core Processors. Proc. Workshop on State-of-the-Art in Scientific and Parallel Computing, Lecture Notes in Computer Science, No. 4699, Springer
Daisuke Takahashi (2008). Implementation and Evaluation of Parallel FFT Using SIMD Instructions on Multi-Core Processors. Proc. 2007 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems
Nicolas Fritz (2009). SIMD Code Generation in Data-Parallel Programming. Ph.D. thesis, Saarland University, pdf

2010 ...

Georg Hager ^[8], Jan Treibig, Gerhard Wellein (2013). The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems. RRZE, SC13, slides as pdf
Kaixi Hou, Hao Wang, Wu-chun Feng (2015). ASPaS: A Framework for Automatic SIMDIZation of Parallel Sorting on x86-based Many-core Processors. ICS2015,

Manuals

AMD

NXP Semiconductors

AltiVec Technology - Programming Interface Manual (pdf) ^[9]

Intel

Intel 64 and IA32 Architectures Optimization Reference Manual (pdf)

Forum Posts

1999

G4 & AltiVec by Will Singleton, CCC, October 04, 1999 » AltiVec, PowerPC G4

2000 ...

Superlinear interpolator: a nice novelity ? by Marco Costalba, CCC, September 20, 2008 » Tapered Eval
Re: talk about IPP's evaluation by Richard Vida, CCC, November 07, 2009 » Ippolit, Tapered Eval

2010 ...

My experience with Linux/GCC by Richard Vida, CCC, March 23, 2011 » C, Linux, Tapered Eval
Re: Utilizing Architecture Specific Functions from a HL Language by Wylie Garvin, CCC, July 31, 2011
two values in one integer by Pierre Bokma, CCC, January 18, 2012
Pigeon now using opportunistic SIMD by Stuart Riffle, CCC, April 11, 2016 » Pigeon
couple of questions about stockfish code ? by Mahmoud Uthman, CCC, October 26, 2016 » Stockfish, Tapered Eval

2020 ...

SIMD methods in TT probing and replacement by Harm Geert Muller, CCC, February 20, 2020 » Transposition Table
CPU Vector Unit, the new jam for NNs... by Srdja Matovic, CCC, November 18, 2020 » NNUE

External Links

x86/x86-64

Other

Misc

References

↑ Flynn's taxonomy from Wikipedia
↑ The Aggregate: SWAR, SIMD Within A Register by Hank Dietz
↑ SSE5 from Wikipedia
↑ SVE from Wikipedia
↑ SVE2 from Wikipedia
↑ VIS from Wikipedia
↑ RISC-V vector-set from Wikipedia
↑ Georg Hager's Blog | Random thoughts on High Performance Computing
↑ On December 7, 2015, NXP Semiconductors completed its acquisition of Freescale, Freescale from Wikipedia

Up one Level

[1] Flynn's taxonomy from Wikipedia

[2] The Aggregate: SWAR, SIMD Within A Register by Hank Dietz

[3] SSE5 from Wikipedia

[4] SVE from Wikipedia

[5] SVE2 from Wikipedia

[6] VIS from Wikipedia

[7] RISC-V vector-set from Wikipedia

[8] Georg Hager's Blog | Random thoughts on High Performance Computing

[9] On December 7, 2015, NXP Semiconductors completed its acquisition of Freescale, Freescale from Wikipedia

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

SIMD and SWAR Techniques

Contents

SIMD Instruction Sets

SWAR Arithmetic

Samples

See also

Publications

1987 ...

2000 ...

2010 ...

Manuals

AMD

NXP Semiconductors

Intel

Forum Posts

1999

2000 ...

2010 ...

2020 ...

External Links

x86/x86-64

Other

Misc

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools