Changes

SSSE3

13,224 bytes added, 13:01, 9 August 2018

Created page with "'''Home * Hardware * x86 * SSSE3''' '''SSSE3''' (Supplemental Streaming SIMD Extension 3) is Intel's name for the SSE instruction set's fo..."

'''[[Main Page|Home]] * [[Hardware]] * [[x86]] * SSSE3'''

'''SSSE3''' (Supplemental Streaming SIMD Extension 3) is [[Intel|Intel's]] name for the [[SSE]] instruction set's fourth iteration. 16 new instructions, also available as [[MMX]]-extension with _m64 intrinsic datatype. SSSE3 was introduced in [https://en.wikipedia.org/wiki/Intel_Core_microarchitecture Intel's Core Microarchitecture]. SSSE3-intrinsic functions are available in [https://en.wikipedia.org/wiki/Visual_C%2B%2B Visual C] <ref>[http://msdn.microsoft.com/en-us/library/vstudio/bb892952%28v=vs.100%29.aspx Supplemental Streaming SIMD Extensions 3 Instructions]</ref> or [https://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler Intel-C] <ref>[https://www.cs.fsu.edu/~engelen/courses/HPC-adv/intref_cls.pdf Intel C++ Compiler for Linux Intrinsics Reference] (pdf)</ref> .

SSSE3 instructions are not available for [[AMD]] processors until Bulldozer, which also implements [[SSE4]] and [[AVX]].

=Instructions=
Some of the new instructions are quite interesting for computer chess, with applications in [[Evaluation|evaluation]] and byte shuffling of [[Bitboards|bitboards]].

==PH(ADD/SUB)W==
Packed Horizontal Add/Subtract Words. Each of the eight shorts integers is the sum/difference between adjacent pairs of elements in the input parameters. Saturating versions, PHADDSW and PHSUBSW, are also available.

The primary downside of these instructions is that they tend to be very slow multiple-uop instructions on most CPUs, resulting in alternate instruction sequences almost always being faster.

===Intrinsic===
_m128i [http://msdn.microsoft.com/en-us/library/bb514049%28v=vs.100%29.aspx _mm_hadd_epi16] (_m128i a, _m128i b);

===Pseudocode===
<pre>
signed short a[ 8]; // input a
signed short b[ 8]; // input b
signed short r[ 8]; // output r

for (i=0; i < 4; i++) {
r[ i] = a[2i] + a[2i+1];
r[4+i] = b[2i] + b[2i+1];
}
</pre>

==PMADDUBSW==
Packed Multiply and Add a vector of 16 unsigned bytes (char) with a vector of 16 signed bytes (not [https://en.wikipedia.org/wiki/Commutative commutative]! <ref>[http://gcc.gnu.org/ml/gcc-patches/2008-03/msg00619.html fixing SSSE3 pmaddubsw description] from the mail archive of the [[mailto:gcc-patches@gcc.gnu.org|gcc-patches@gcc.gnu.org]] mailing list for the [[Free Software Foundation#GCC|GCC project]]</ref> ). Two consecutive products are added and the saturated signed 16-bit results are stored as vector of eight signed shorts.

===Intrinsic===
_m128i [http://msdn.microsoft.com/en-us/library/bb514017%28v=vs.100%29.aspx _mm_maddubs_epi16] (_m128i a, _m128i b);

===Pseudocode===
<pre>
unsigned char a[16]; // input a
signed char b[16]; // input b
signed short r[ 8]; // output r

for (i=0; i < 8; i++)
r[i] = SATURATE_16(a[2i]*b[2i] + a[2i+1] * b[2i+1]);
</pre>

==PMULHRSW==
Packed Multiply High with Round and Scale is an instruction designed for fixed-point math. It is similar to the existing pmulhw, but only shifts right by 15 instead of 16, and adds a factor for correct rounding.

===Intrinsic===
_m128i [http://msdn.microsoft.com/en-us/library/bb513995%28v=vs.100%29.aspx _mm_mulhrs_epi16] (_m128i a, _m128i b);

===Pseudocode===
<pre>
signed short a[8]; // input a
signed short b[8]; // input b
signed short r[8]; // output r

for (i=0; i < 8; i++)
r[i] = INT16((a[i]*b[i] + 0x4000) >> 15);
</pre>

==PSHUFB==
Packed Shuffle Bytes is a very powerful instruction that can perform a fast arbitrary byte-shuffle of a register. It can also set some output bytes to zero instead of selecting them from the input. Packed Shuffle Bytes is used inside the [[SSSE3#SSSE3Version|SSSE3 Version of Hyperbola Quintessence]] to perform byte swaps. There might be some other interesting applications too, such as [[SSSE3#PopCount|SSSE3 Population Count]]. [[XOP#VPPERM|VPPERM]] from [[XOP]] is an even more powerful variant on this instruction.

===Intrinsic===
_m128i [http://msdn.microsoft.com/en-us/library/bb531427%28v=vs.100%29.aspx _mm_shuffle_epi8] (_m128i a, _m128i b);

===Pseudocode===
<pre>
char a[16]; // input a
char b[16]; // input b
char r[16]; // output r

for (i=0; i < 16; i++)
r[i] = (b[i] < 0) ? 0 : a[b[i] % 16];
</pre>

==PSIGN==
Multiplies each element of one vector with the [https://en.wikipedia.org/wiki/Sign_function sign function] {-1,0,1} of the second vector. The instruction is available for signed [[Byte|bytes]] (8-bit char psignb), signed [[Word|words]] (16-bit short psignw) and [[Double Word|double words]] (32-bit int psignd). If both input operands are equal, the result is a vector of absolute values, though PABS is probably preferred for this purpose.

===Intrinsic===
_m128i [http://msdn.microsoft.com/en-us/library/bb531461%28v=vs.100%29.aspx _mm_sign_epi8] (_m128i a, _m128i b);
_m128i [http://msdn.microsoft.com/en-us/library/bb531410%28v=vs.100%29.aspx _mm_sign_epi16] (_m128i a, _m128i b);
_m128i [http://msdn.microsoft.com/en-us/library/bb514081%28v=vs.100%29.aspx _mm_sign_epi32] (_m128i a, _m128i b);

===Pseudocode===
<pre>
/* type := {char, short, int}, N = {16, 8, 4} */
#define N (sizeof(__m128i)/sizeof(type))
type a[N]; // input a
type b[N]; // input b
type r[N]; // output r

for (i=0; i < N; i++)
r[i] = (b[i] < 0) ? -a[i] : ((b[i] == 0) ? 0 : a[i])
</pre>

=Applications=
==SSE3 Population Count==
In 2008, [[Wojciech Muła]] introduced a SSSE3 [[Population Count]] performing a pair of [[SSSE3#PSHUFB|PSHUFB]] 16 parallel [[Nibble|nibble]] in-xmm lookups <ref>[http://0x80.pl/articles/sse-popcount.html SSSE3: fast popcount] by [[Wojciech Muła]], May 24, 2008</ref>, in the meantime due to [[AVX2]] or [[AVX-512]] even possible with doubled or fourfold register widths, competing the native [[x86-64]] [[x86-64#gpinstructions|popcount instruction]] <ref>[[Wojciech Muła]], [http://dblp.uni-trier.de/pers/hd/k/Kurz:Nathan Nathan Kurz], [https://github.com/lemire Daniel Lemire] ('''2016'''). ''Faster Population Counts Using AVX2 Instructions''. [https://arxiv.org/abs/1611.07612 arXiv:1611.07612]</ref>
<pre>
/**
* popCount2
* @author Wojciech Muła
* @param v vector of two bitboards
* @return quad word vector of two popcounts
*/
__m128i popCount(__m128i v) {
__m128i lu = _mm_setr_epi8(0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4);
__m128i lm = _mm_set1_epi8(0x0f);
__m128i lo = _mm_and_si128(v, lm);
__m128i hi = _mm_and_si128(_mm_srli_epi16(v, 4), lm);
__m128i cl = _mm_shuffle_epi8(lu, lo); /* 16 low nibble counts */
__m128i ch = _mm_shuffle_epi8(lu, hi); /* 16 high nibble counts */
__m128i cb = _mm_add_epi8(cl, ch); /* 16 byte counts */
return _mm_sad_epu8(cb, _mm_setzero_si128());
}
</pre>

==Byte-wise Dot-Product==
This SSSE3-dot-product multiplies a vector of 64 unsigned chars with a vector of 64 signed char, and adds all 64 intermediate signed 16-bit products with saturation. It uses the [[SSSE3#PMADDUBSW|pmaddubsw]] and [[SSSE3#PHADDSW|phaddsw]] SSSE3 instructions, in total 11 SSE instructions.
<pre>
int dotProduct(unsigned char features[], char weights[] /* XMM_ALIGN */) {
__m128i r0, r1, r2, r3;
__m128i* a = (__m128i*) features;
__m128i* b = (__m128i*) weights;
r0 = _mm_maddubs_epi16 (a[0], b[0]);
r1 = _mm_maddubs_epi16 (a[1], b[1]);
r2 = _mm_maddubs_epi16 (a[2], b[2]);
r3 = _mm_maddubs_epi16 (a[3], b[3]);
r0 = _mm_adds_epi16 (r0, r1);
r2 = _mm_adds_epi16 (r2, r3);
r0 = _mm_adds_epi16 (r0, r2); // 8 shorts
r0 = _mm_hadds_epi16 (r0, r0); // 4 shorts
r0 = _mm_hadds_epi16 (r0, r0); // 2 shorts
r0 = _mm_hadds_epi16 (r0, r0); // 1 final short
short s = (short)_mm_extract_epi16(r0, 0);
return (int) s; // sign extended
}
</pre>

==SSSE3 Hyperbola Quintessence==
Following routine calculates bishop attacks performing the [[Hyperbola Quintessence]]. Both [[Anti-Diagonals|anti-diagonal]] and [[Diagonals|diagonal]] attacks are processed in parallel, using both halves of a 128-bit xmm register and pre-calculated [[Array|arrays]] of bitboard pairs for line-masks, single- and eventually reversed bits. [[SSSE3#PSHUFB|PSHUFB]] is used to swap bytes inside a [[Bitboards|bitboard]] <ref>[http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=367616&t=35858 Re: Plain and fancy magic on modern hardware] by [[Robert Purves]], [[Computer Chess Forums|CCC]], August 23, 2010</ref> .
===Intrinsic Version===
<pre>
__m128i diaAntiMaskXMM[64]; // 1 KByte antidiag : diagonal, excluding square
__m128i singleBitsXMM [64]; // 1 KByte 1<<sq : 1<<sq
__m128i swapMaskXMM; // needs to be initialized to swap the bytes in both quad-words

// SSSE3 Hyperbola Quintessence
U64 bishopAttacks(U64 occ, enumSquare sq) {
__m128i o, r, m, b, s;
m = diaAntiMaskXMM[sq]; // antidiag : diagonal, excluding square
b = singleBitsXMM [sq]; // bishop bits, equal qwords
s = swapMaskXMM;
o = _mm_cvtsi64x_si128 (occ) ; // general purpose 64 bit to xmm low qword
o = _mm_unpacklo_epi64 (o, o); // occ : occ
o = _mm_and_si128 (o, m); // o (antidiag : diagonal)
r = _mm_shuffle_epi8 (o, s); // o'(antidiag : diagonal)
o = _mm_sub_epi64 (o, b); // o - bishop
b = _mm_shuffle_epi8 (b, s); // bishop', one may also use singleBitsXMM [sq^56]
r = _mm_sub_epi64 (r, b); // o'- bishop'
r = _mm_shuffle_epi8 (r, s); // re-reverse
o = _mm_xor_si128 (o, r); // attacks
o = _mm_and_si128 (o, m); // antidiag : diagonal
r = _mm_unpackhi_epi64 (o, o); // antidiag : antidiag
o = _mm_add_epi64 (o, r); // diagonal + antidiag
return _mm_cvtsi128_si64(o); // convert xmm to general purpose 64 bit
}
</pre>

===Peshkov's Optimization===
The pioneer of Hyperbola Quintessence, [[Aleks Peshkov]], applies a more sophisticated, optimized [[Cpp|C++]] approach, further utilizing disjoint ray-attacks, and [[General Setwise Operations#ExclusiveOr|xor]] instruction is own inverse and distributive over [[Flipping Mirroring and Rotating|mirroring or flipping]]. Once per [[Node|node]] he instantiates an [[Occupancy|occupied object]] based on a [https://en.wikipedia.org/wiki/Type_safety type-safe] [https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern curiously recurring template pattern] aka the [https://en.wikipedia.org/wiki/Barton%E2%80%93Nackman_trick Barton–Nackman trick] to encapsulate the SSE intrinsic data types, to keep a twin of normal and flipped occupancy as base for further [[Files|file]]-, [[Diagonals|diagonal]] or [[Anti-Diagonals|anti-diagonal]] attack generations, which then requires only one final shuffle per piece attack getter, whether it is a bishop or even a queen.

<pre>
class Occupied : public BitSet<Occupied, char_x16_t> {
typedef BitSet<Occupied, char_x16_t> Base;

struct Mask {
value_type singleton;
value_type file;
value_type diagonal;
value_type antidiag;

Mask operator () (Square);
};
typedef Square::const_array<Mask, Mask> CACHE_ALIGN AttackMask;
static const AttackMask mask;

struct Shuffle {
value_type hyperbola;
value_type flipShift;
Shuffle ();
};
static const Shuffle shuffle;

static BitBoard bitboard(value_type value) {
return BitBoard(static_cast<BitBoard::value_type>(_mm_cvtsi128_si64(value)));
}

static BitBoard hyperbola(value_type value) {
value_type reverse = _mm_shuffle_epi8(value, shuffle.hyperbola);
return bitboard(value ^ reverse); //xor is principal here
}

INLINE BitBoard bishopAttack(Square from) const {
value_type d = value;
value_type a = value;
d &= mask[from].diagonal;
a &= mask[from].antidiag;
d = _mm_sub_epi64(d, mask[from].singleton);
a = _mm_sub_epi64(a, mask[from].singleton);
d &= mask[from].diagonal;
a &= mask[from].antidiag;
return hyperbola(d ^ a);
}

...
public:
Occupied (const BitBoard& myB, const BitBoard& opB) {
value_type my = _mm_cvtsi64_si128(reinterpret_cast<const U64&>(myB));
value_type op = _mm_cvtsi64_si128(reinterpret_cast<const U64&>(opB));
op = _mm_shuffle_epi8(op, shuffle.flipShift);
assert (Occupied(my & op) == empty()); //no intersection
this->value = my ^ op;
}

...
};
</pre>

=See also=
* [[AltiVec]]
* [[AVX]]
* [[AVX2]]
* [[AVX-512]]
* [[MMX]]
* [[SIMD and SWAR Techniques]]
* [[SSE2]]
* [[SSE3]]
* [[SSE4]]
* [[SSE5]]
* [[x86-64]]
* [[XOP]]

=External Links=
* [https://en.wikipedia.org/wiki/SSSE3 SSSE3 from Wikipedia]
* [http://sseplus.sourceforge.net/index.html SSEPlus Project Documentation]
* [http://www.agner.org/optimize/blog/ Agner`s CPU blog] by [http://www.agner.org/ Agner Fog]
* [http://0x80.pl/articles/sse-popcount.html SSSE3: fast popcount] by [[Wojciech Muła]], May 24, 2008 » [[Population Count]]
* [http://software.intel.com/sites/landingpage/IntrinsicsGuide/ Intel Intrinsics Guide]

=References=
<references />

'''[[x86|Up one Level]]'''

GerdIsenberg

Bureaucrats, Administrators

25,161

edits

Changes

SSSE3

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools