Changes

Jump to: navigation, search

SIMD and SWAR Techniques

10,428 bytes added, 21:08, 10 May 2018
Created page with "'''Home * Programming * SIMD and SWAR Techniques''' [[FILE:SIMD.svg|border|right|thumb|[https://en.wikipedia.org/wiki/SIMD SIMD] <ref>[https://en.wikipedia...."
'''[[Main Page|Home]] * [[Programming]] * SIMD and SWAR Techniques'''

[[FILE:SIMD.svg|border|right|thumb|[https://en.wikipedia.org/wiki/SIMD SIMD] <ref>[https://en.wikipedia.org/wiki/Flynn%27s_taxonomy Flynn's taxonomy from Wikipedia]</ref> ]]

[[x86]], [[x86-64]], as well as [[PowerPC#G4|PowerPC]] and [https://en.wikipedia.org/wiki/Power_Architecture#Power_ISA_v.2.03 Power ISA v.2.03] processors provide '''Single Instructions''' on '''Multiple Data''' (SIMD), namely on [[Array|vectors]] of [[Float|floats]], [[Double|doubles]] or various integers, [[Byte|bytes]], [[Word|words]], [[Double Word|double words]] or [[Quad Word|quad words]], available through assembly and compiler intrinsics. SIMD-applications related to computer chess cover [[Bitboards|bitboard]] computations and [[Fill Algorithms|fill-algorithms]] like [[Dumb7Fill]] and [[Kogge-Stone Algorithm]], as well as [[Evaluation|evaluation]] related stuff, like this [[SSE2#SSE2dotproduct|SSE2 dot-product]] of 64 bits by a vector of 64 bytes.

'''SWAR''' as acronym for SIMD Within A Register was coined by [[Hank Dietz]] and Randy Fisher <ref>[http://www.aggregate.org/SWAR/ The Aggregate: SWAR, SIMD Within A Register] by [[Hank Dietz]]</ref> . It is a processing model which applies SIMD parallel processing across sections of a CPU register, often vectors of smaller than byte-entities are processed in [[Parallel Prefix Algorithms|parallel prefix]] manner.

=SIMD Instruction Sets=
* [[MMX]] on [[x86]] and [[x86-64]]
* [[SSE2]], [[SSE3]], [[SSSE3]] and [[SSE4]] on [[x86]] and [[x86-64]]
* [[SSE5]] by [[AMD]] (proposed but not implemented, replaced by [[XOP]] <ref>[https://en.wikipedia.org/wiki/SSE5 SSE5 from Wikipedia]</ref>)
* [[AltiVec]] on [[PowerPC#G4|PowerPC G4]], [[PowerPC#G5|PowerPC G5]]
* [[ARM NEON]]
* [[AVX]] by [[Intel]]
* [[AVX2]] by [[Intel]]
* [[AVX-512]] by [[Intel]]
* [[XOP]] by [[AMD]]
<span id="SWAR"></span>
=SWAR Arithmetic=
To apply addition and subtraction on vectors of bit-aggregates or [https://en.wikipedia.org/wiki/Bit_field bit-field structures] within a general purpose register, one has to take care carries and borrows don't wrap around. Thus the need to mask of all most significant bits (H) and add in two steps, one 'add' with MSB clear and one add modulo 2 aka '[[General Setwise Operations#ExclusiveOr|xor]]' for the MSB itself. For bytewise (rankwise) math inside a 64-bit register, H is <span style="background-color: #e3e3e3;">0x8080808080808080</span> and L is <span style="background-color: #e3e3e3;">0x0101010101010101</span>.
<pre>
SWAR add z = x + y
z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)
</pre>
<pre>
SWAR sub z = x - y
z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)
</pre>
<pre>
SWAR average z = (x+y)/2 based on x + y = (x^y) + 2*(x&y)
z = (x & y) + (((x ^ y) & ~L) >> 1)
</pre>

=Samples=
Amazing, how similar these two SWAR- and [[Parallel Prefix Algorithms|parallel prefix wise]] routines are. [[Flipping Mirroring and Rotating#MirrorHorizontally|Mirror horizontally]] and [[Population Count#SWARPopcount|population count]] have in common to act on vectors of duos, [[Nibble|nibbles]] and [[Byte|bytes]]. One swaps bits, duos and nibbles, while the second adds populations of them.
<pre>
U64 mirrorHorizontal (U64 x) {
const U64 k1 = C64(0x5555555555555555);
const U64 k2 = C64(0x3333333333333333);
const U64 k4 = C64(0x0f0f0f0f0f0f0f0f);
x = ((x & k1) << 1) | ((x >> 1) & k1);
x = ((x & k2) << 2) | ((x >> 2) & k2);
x = ((x & k4) << 4) | ((x >> 4) & k4);
return x;
}
</pre>
<pre>
int popCount (U64 x) {
const U64 k1 = C64(0x5555555555555555);
const U64 k2 = C64(0x3333333333333333);
const U64 k4 = C64(0x0f0f0f0f0f0f0f0f);
x = x - ((x >> 1) & k1);
x = (x & k2) + ((x >> 2) & k2);
x = ( x + (x >> 4)) & k4 ;
x = (x * C64(0x0101010101010101))>> 56;
return (int) x;
}
</pre>
=Publications=
* [https://www.linkedin.com/in/tom-thompson-500bb7b Tom Thompson] ('''1999'''). ''[http://www.mactech.com/articles/mactech/Vol.15/15.07/AltiVecRevealed/index.html AltiVec Revealed]''. [http://www.mactech.com/ MacTech], Vol. 15, No. 7
* [https://www.researchgate.net/profile/Nicolas_Fritz2 Nicolas Fritz] ('''2009'''). ''SIMD Code Generation in Data-Parallel Programming''. Ph.D. thesis, [https://en.wikipedia.org/wiki/Saarland_University Saarland University], [http://scidok.sulb.uni-saarland.de/volltexte/2009/2563/pdf/Dissertation_9229_Frit_Nico_2009.pdf?q=ibms-cell-processor pdf]
* [https://www.rrze.fau.de/wir-ueber-uns/organigramm/mitarbeiter/index.shtml/georg-hager.shtml Georg Hager] <ref>[https://blogs.fau.de/hager/ Georg Hager's Blog | Random thoughts on High Performance Computing]</ref>, [http://dblp.uni-trier.de/pers/hd/t/Treibig:Jan Jan Treibig], [http://dblp.uni-trier.de/pers/hd/w/Wellein:Gerhard Gerhard Wellein] ('''2013'''). ''The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems''. [https://de.wikipedia.org/wiki/Regionales_Rechenzentrum_Erlangen RRZE], [http://sc13.supercomputing.org/ SC13], [https://blogs.fau.de/hager/files/2013/11/sc13_tutorial_134.pdf slides as pdf]
* [https://scholar.google.com/citations?user=4Ab_NBkAAAAJ&hl=en Kaixi Hou], [[Hao Wang]], [http://dblp.uni-trier.de/pers/hd/f/Feng:Wu=chun Wu-chun Feng] ('''2015'''). ''ASPaS: A Framework for Automatic SIMDIZation of Parallel Sorting on x86-based Many-core Processors''. [http://dblp.uni-trier.de/db/conf/ics/ics2015.html#HouWF15 ICS2015],

=Manuals=
==AMD==
* [http://developer.amd.com/wordpress/media/2012/10/26568_APM_v41.pdf AMD64 Architecture Volume 4: 128-Bit and 256-Bit Media Instructions] (pdf)
* [http://support.amd.com/TechDocs/26569_APM_v5.pdf AMD64 Architecture Volume 5: 64-Bit Media and x87 Floating-Point Instructions] (pdf)
* [http://support.amd.com/TechDocs/43479.pdf AMD64 Architecture Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions] (pdf)
==NXP Semiconductors==
* [http://www.nxp.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf AltiVec Technology - Programming Interface Manual] (pdf) <ref>On December 7, 2015, [https://en.wikipedia.org/wiki/NXP_Semiconductors NXP Semiconductors] completed its acquisition of Freescale, [https://en.wikipedia.org/wiki/Freescale_Semiconductor Freescale from Wikipedia]</ref>
==Intel==
* [http://www.intel.com/design/processor/manuals/248966.pdf Intel 64 and IA32 Architectures Optimization Reference Manual] (pdf)

=Forum Posts=
* [https://www.stmintz.com/ccc/index.php?id=71754 G4 & AltiVec] by [[Will Singleton]], [[CCC]], October 04, 1999
* [http://www.talkchess.com/forum/viewtopic.php?t=23860 Superlinear interpolator: a nice novelity ?] by [[Marco Costalba]], [[CCC]], September 20, 2008 » [[Tapered Eval]]
* [http://www.talkchess.com/forum/viewtopic.php?p=301746#301746 Re: talk about IPP's evaluation] by [[Richard Vida]], [[CCC]], November 07, 2009 » [[Ippolit]], [[Tapered Eval]]
* [http://www.talkchess.com/forum/viewtopic.php?t=38523 My experience with Linux/GCC] by [[Richard Vida]], [[CCC]], March 23, 2011 » [[C]], [[Linux]], [[Tapered Eval]]
* [http://www.talkchess.com/forum/viewtopic.php?t=39916&start=1 Re: Utilizing Architecture Specific Functions from a HL Language] by [[Wylie Garvin]], [[CCC]], July 31, 2011
* [http://www.talkchess.com/forum/viewtopic.php?t=42054 two values in one integer] by [[Pierre Bokma]], [[CCC]], January 18, 2012
* [http://www.talkchess.com/forum/viewtopic.php?t=61850 couple of questions about stockfish code ?] by [[Mahmoud Uthman]], [[CCC]], October 26, 2016 » [[Stockfish]], [[Tapered Eval]]

=External Links=
* [https://en.wikipedia.org/wiki/SIMD SIMD from Wikipedia]
* [https://en.wikipedia.org/wiki/SWAR SWAR from Wikipedia]
* [http://www.aggregate.org/SWAR/ The Aggregate: SWAR, SIMD Within A Register] by [[Hank Dietz]]
* [http://andythomason.com/lecture_notes/agp/agp_simd_programming.html Advanced game programing | Session 4 - Math libraries and SIMD] from [http://andythomason.com/lecture_notes/ Game programming lecture notes] by [[Andy Thomason]]
==x86==
* [https://en.wikipedia.org/wiki/MMX_%28instruction_set%29 MMX from Wikipedia]
* [https://en.wikipedia.org/wiki/3DNow 3DNow! from Wikipedia]
* [https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions Streaming SIMD Extensions from Wikipedia]
* [https://en.wikipedia.org/wiki/SSE2 SSE2 from Wikipedia]
* [https://en.wikipedia.org/wiki/SSE3 SSE3 from Wikipedia]
* [https://en.wikipedia.org/wiki/SSSE3 SSSE3 from Wikipedia]
* [https://en.wikipedia.org/wiki/SSE4 SSE4 from Wikipedia]
* [http://de.wikipedia.org/wiki/SSE4a SSE4a from Wikipedia]
* [https://en.wikipedia.org/wiki/SSE5 SSE5 from Wikipedia]
* [https://en.wikipedia.org/wiki/XOP_instruction_set XOP instruction set from Wikipedia]
* [https://en.wikipedia.org/wiki/Advanced_Vector_Extensions Advanced Vector Extensions from Wikipedia]
* [https://en.wikipedia.org/wiki/AVX-512 AVX-512 from Wikipedia]
* [http://developer.amd.com/cpu/Libraries/sseplus/Pages/default.aspx SSEPlus Project] from [http://developer.amd.com/pages/default.aspx AMD Developer Central]
* [http://sseplus.sourceforge.net/index.html SSEPlus Project Documentation]
==Other SIMD==
* [http://www.arm.com/products/multimedia/neon/index.html ARM NEON Technology]
* [https://en.wikipedia.org/wiki/ARM_architecture#Advanced_SIMD_.28NEON.29 ARM NEON Technology from Wikipedia]
* [https://en.wikipedia.org/wiki/AltiVec AltiVec from Wikipedia]
* [http://developer.apple.com/hardwaredrivers/ve/sse.html Hardware - SSE Performance Programming] from [http://developer.apple.com/index.html Apple Developer]
* [http://developer.apple.com/hardwaredrivers/ve/instruction_crossref.html Apple Instruction Cross-Reference] from [http://developer.apple.com/index.html Apple Developer]
==Misc==
* [https://en.wikipedia.org/wiki/Explicitly_parallel_instruction_computing Explicitly parallel instruction computing (EPIC) from Wikipedia]
* [https://en.wikipedia.org/wiki/Instruction-level_parallelism Instruction-level parallelism from Wikipedia]
* [https://en.wikipedia.org/wiki/MIMD MIMD from Wikipedia]
* [https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia] » [[GPU]], [[Thread]]
* [https://en.wikipedia.org/wiki/SPMD SPMD from Wikipedia]
* [https://en.wikipedia.org/wiki/Very_long_instruction_word Very long instruction word (VLIW) from Wikipedia]

=References=
<references />

'''[[Programming|Up one Level]]'''

Navigation menu