Difference between revisions of "SIMD and SWAR Techniques"

From Chessprogramming wiki
Jump to: navigation, search
(8 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
[[x86]], [[x86-64]], as well as [[PowerPC#G4|PowerPC]] and [https://en.wikipedia.org/wiki/Power_Architecture#Power_ISA_v.2.03 Power ISA v.2.03] processors provide '''Single Instructions''' on '''Multiple Data''' (SIMD), namely on [[Array|vectors]] of [[Float|floats]], [[Double|doubles]] or various integers, [[Byte|bytes]], [[Word|words]], [[Double Word|double words]] or [[Quad Word|quad words]], available through assembly and compiler intrinsics. SIMD-applications related to computer chess cover [[Bitboards|bitboard]] computations and [[Fill Algorithms|fill-algorithms]] like [[Dumb7Fill]] and [[Kogge-Stone Algorithm]], as well as [[Evaluation|evaluation]] related stuff, like this [[SSE2#SSE2dotproduct|SSE2 dot-product]] of 64 bits by a vector of 64 bytes.
 
[[x86]], [[x86-64]], as well as [[PowerPC#G4|PowerPC]] and [https://en.wikipedia.org/wiki/Power_Architecture#Power_ISA_v.2.03 Power ISA v.2.03] processors provide '''Single Instructions''' on '''Multiple Data''' (SIMD), namely on [[Array|vectors]] of [[Float|floats]], [[Double|doubles]] or various integers, [[Byte|bytes]], [[Word|words]], [[Double Word|double words]] or [[Quad Word|quad words]], available through assembly and compiler intrinsics. SIMD-applications related to computer chess cover [[Bitboards|bitboard]] computations and [[Fill Algorithms|fill-algorithms]] like [[Dumb7Fill]] and [[Kogge-Stone Algorithm]], as well as [[Evaluation|evaluation]] related stuff, like this [[SSE2#SSE2dotproduct|SSE2 dot-product]] of 64 bits by a vector of 64 bytes.
  
'''SWAR''' as acronym for SIMD Within A Register was coined by [[Hank Dietz]] and Randy Fisher <ref>[http://www.aggregate.org/SWAR/ The Aggregate: SWAR, SIMD Within A Register] by [[Hank Dietz]]</ref> . It is a processing model which applies SIMD parallel processing across sections of a CPU register, often vectors of smaller than byte-entities are processed in [[Parallel Prefix Algorithms|parallel prefix]] manner.  
+
'''SWAR''' as acronym for SIMD Within A Register was coined by [[Hank Dietz]] and '''Randell J. Fisher''' <ref>[http://www.aggregate.org/SWAR/ The Aggregate: SWAR, SIMD Within A Register] by [[Hank Dietz]]</ref> . It is a processing model which applies SIMD parallel processing across sections of a CPU register, often vectors of smaller than byte-entities are processed in [[Parallel Prefix Algorithms|parallel prefix]] manner.  
  
 
=SIMD Instruction Sets=  
 
=SIMD Instruction Sets=  
Line 13: Line 13:
 
* [[AltiVec]] on [[PowerPC#G4|PowerPC G4]], [[PowerPC#G5|PowerPC G5]]
 
* [[AltiVec]] on [[PowerPC#G4|PowerPC G4]], [[PowerPC#G5|PowerPC G5]]
 
* [[ARM NEON]]
 
* [[ARM NEON]]
 +
* [[ARM Helium]]
 
* [[AVX]] by [[Intel]]  
 
* [[AVX]] by [[Intel]]  
 
* [[AVX2]] by [[Intel]]
 
* [[AVX2]] by [[Intel]]
Line 58: Line 59:
 
}
 
}
 
</pre>
 
</pre>
 +
=See  also=
 +
* [[GPU]]
 +
* [[NNUE]]
 +
* [[Parallel Prefix Algorithms]]
 +
 
=Publications=
 
=Publications=
 +
==1987 ...==
 +
*  [[Alan H. Bond]] ('''1987'''). ''Broadcasting Arrays - A Highly Parallel Computer Architecture Suitable For Easy Fabrication''. [http://www.exso.com/bc.pdf pdf]
 +
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]
 +
* [https://dblp.uni-trier.de/pers/f/Fisher:Randall_J=.html Randell J. Fisher], [[Hank Dietz]] ('''1998'''). ''[https://link.springer.com/chapter/10.1007/3-540-48319-5_19 Compiling for SIMD Within a Register]''.  [https://dblp.uni-trier.de/db/conf/lcpc/lcpc1998.html LCPC 1998], [https://link.springer.com/chapter/10.1007/3-540-48319-5_19 pdf]
 
* [https://www.linkedin.com/in/tom-thompson-500bb7b Tom Thompson] ('''1999'''). ''[http://www.mactech.com/articles/mactech/Vol.15/15.07/AltiVecRevealed/index.html AltiVec Revealed]''. [http://www.mactech.com/ MacTech], Vol. 15, No. 7
 
* [https://www.linkedin.com/in/tom-thompson-500bb7b Tom Thompson] ('''1999'''). ''[http://www.mactech.com/articles/mactech/Vol.15/15.07/AltiVecRevealed/index.html AltiVec Revealed]''. [http://www.mactech.com/ MacTech], Vol. 15, No. 7
* [[Daisuke Takahashi]] ('''2007'''). ''[http://www.springerlink.com/content/6481607675376w42/ An Implementation of Parallel 1-D FFT Using SSE3 Instructions on Dual-Core Processors]''. Proc. Workshop on State-of-the-Art in Scientific and Parallel Computing, [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], No. 4699, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]
+
==2000 ...==
 +
* [https://dblp.uni-trier.de/pers/f/Fisher:Randall_J=.html Randell J. Fisher] ('''2003'''). ''[https://docs.lib.purdue.edu/dissertations/AAI3108343/ General-Purpose SIMD Within A Register: Parallel Processing on Consumer Microprocessors]''. Ph.D. thesis, [https://en.wikipedia.org/wiki/Purdue_University Purdue University], advisor [[Hank Dietz]], [http://aggregate.org/SWAR/Dis/dissertation.pdf pdf]
 +
* [[Daisuke Takahashi]] ('''2007'''). ''[https://link.springer.com/chapter/10.1007/978-3-540-75755-9_135/ An Implementation of Parallel 1-D FFT Using SSE3 Instructions on Dual-Core Processors]''. Proc. Workshop on State-of-the-Art in Scientific and Parallel Computing, [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], No. 4699, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]
 
* [[Daisuke Takahashi]] ('''2008'''). ''Implementation and Evaluation of Parallel FFT Using SIMD Instructions on Multi-Core Processors''. Proc. 2007 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems
 
* [[Daisuke Takahashi]] ('''2008'''). ''Implementation and Evaluation of Parallel FFT Using SIMD Instructions on Multi-Core Processors''. Proc. 2007 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems
 
* [https://www.researchgate.net/profile/Nicolas_Fritz2 Nicolas Fritz] ('''2009'''). ''SIMD Code Generation in Data-Parallel Programming''. Ph.D. thesis, [https://en.wikipedia.org/wiki/Saarland_University Saarland University], [http://scidok.sulb.uni-saarland.de/volltexte/2009/2563/pdf/Dissertation_9229_Frit_Nico_2009.pdf?q=ibms-cell-processor pdf]
 
* [https://www.researchgate.net/profile/Nicolas_Fritz2 Nicolas Fritz] ('''2009'''). ''SIMD Code Generation in Data-Parallel Programming''. Ph.D. thesis, [https://en.wikipedia.org/wiki/Saarland_University Saarland University], [http://scidok.sulb.uni-saarland.de/volltexte/2009/2563/pdf/Dissertation_9229_Frit_Nico_2009.pdf?q=ibms-cell-processor pdf]
 +
==2010 ...==
 
* [https://www.rrze.fau.de/wir-ueber-uns/organigramm/mitarbeiter/index.shtml/georg-hager.shtml Georg Hager] <ref>[https://blogs.fau.de/hager/ Georg Hager's Blog | Random thoughts on High Performance Computing]</ref>, [http://dblp.uni-trier.de/pers/hd/t/Treibig:Jan Jan Treibig], [http://dblp.uni-trier.de/pers/hd/w/Wellein:Gerhard Gerhard Wellein] ('''2013'''). ''The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems''. [https://de.wikipedia.org/wiki/Regionales_Rechenzentrum_Erlangen RRZE], [http://sc13.supercomputing.org/ SC13], [https://blogs.fau.de/hager/files/2013/11/sc13_tutorial_134.pdf slides as pdf]
 
* [https://www.rrze.fau.de/wir-ueber-uns/organigramm/mitarbeiter/index.shtml/georg-hager.shtml Georg Hager] <ref>[https://blogs.fau.de/hager/ Georg Hager's Blog | Random thoughts on High Performance Computing]</ref>, [http://dblp.uni-trier.de/pers/hd/t/Treibig:Jan Jan Treibig], [http://dblp.uni-trier.de/pers/hd/w/Wellein:Gerhard Gerhard Wellein] ('''2013'''). ''The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems''. [https://de.wikipedia.org/wiki/Regionales_Rechenzentrum_Erlangen RRZE], [http://sc13.supercomputing.org/ SC13], [https://blogs.fau.de/hager/files/2013/11/sc13_tutorial_134.pdf slides as pdf]
 
* [https://scholar.google.com/citations?user=4Ab_NBkAAAAJ&hl=en Kaixi Hou], [[Hao Wang]], [http://dblp.uni-trier.de/pers/hd/f/Feng:Wu=chun Wu-chun Feng] ('''2015'''). ''ASPaS: A Framework for Automatic SIMDIZation of Parallel Sorting on x86-based Many-core Processors''. [http://dblp.uni-trier.de/db/conf/ics/ics2015.html#HouWF15 ICS2015],  
 
* [https://scholar.google.com/citations?user=4Ab_NBkAAAAJ&hl=en Kaixi Hou], [[Hao Wang]], [http://dblp.uni-trier.de/pers/hd/f/Feng:Wu=chun Wu-chun Feng] ('''2015'''). ''ASPaS: A Framework for Automatic SIMDIZation of Parallel Sorting on x86-based Many-core Processors''. [http://dblp.uni-trier.de/db/conf/ics/ics2015.html#HouWF15 ICS2015],  
Line 77: Line 90:
  
 
=Forum Posts=  
 
=Forum Posts=  
* [https://www.stmintz.com/ccc/index.php?id=71754 G4 & AltiVec] by [[Will Singleton]], [[CCC]], October 04, 1999
+
==1999==
 +
* [https://www.stmintz.com/ccc/index.php?id=71754 G4 & AltiVec] by [[Will Singleton]], [[CCC]], October 04, 1999 » [[AltiVec]], [[PowerPC #G4|PowerPC G4]]
 +
==2000 ...==
 
* [http://www.talkchess.com/forum/viewtopic.php?t=23860 Superlinear interpolator: a nice novelity ?] by [[Marco Costalba]], [[CCC]], September 20, 2008 » [[Tapered Eval]]
 
* [http://www.talkchess.com/forum/viewtopic.php?t=23860 Superlinear interpolator: a nice novelity ?] by [[Marco Costalba]], [[CCC]], September 20, 2008 » [[Tapered Eval]]
 
* [http://www.talkchess.com/forum/viewtopic.php?p=301746#301746 Re: talk about IPP's evaluation] by [[Richard Vida]], [[CCC]], November 07, 2009 » [[Ippolit]], [[Tapered Eval]]
 
* [http://www.talkchess.com/forum/viewtopic.php?p=301746#301746 Re: talk about IPP's evaluation] by [[Richard Vida]], [[CCC]], November 07, 2009 » [[Ippolit]], [[Tapered Eval]]
 +
==2010 ...==
 
* [http://www.talkchess.com/forum/viewtopic.php?t=38523 My experience with Linux/GCC] by [[Richard Vida]], [[CCC]], March 23, 2011 » [[C]], [[Linux]], [[Tapered Eval]]
 
* [http://www.talkchess.com/forum/viewtopic.php?t=38523 My experience with Linux/GCC] by [[Richard Vida]], [[CCC]], March 23, 2011 » [[C]], [[Linux]], [[Tapered Eval]]
 
* [http://www.talkchess.com/forum/viewtopic.php?t=39916&start=1 Re: Utilizing Architecture Specific Functions from a HL Language] by [[Wylie Garvin]], [[CCC]], July 31, 2011
 
* [http://www.talkchess.com/forum/viewtopic.php?t=39916&start=1 Re: Utilizing Architecture Specific Functions from a HL Language] by [[Wylie Garvin]], [[CCC]], July 31, 2011
 
* [http://www.talkchess.com/forum/viewtopic.php?t=42054 two values in one integer] by [[Pierre Bokma]], [[CCC]], January 18, 2012
 
* [http://www.talkchess.com/forum/viewtopic.php?t=42054 two values in one integer] by [[Pierre Bokma]], [[CCC]], January 18, 2012
 +
* [http://www.talkchess.com/forum/viewtopic.php?t=59820 Pigeon now using opportunistic SIMD] by [[Stuart Riffle]], [[CCC]], April 11, 2016 » [[Pigeon]]
 
* [http://www.talkchess.com/forum/viewtopic.php?t=61850 couple of questions about stockfish code ?] by [[Mahmoud Uthman]], [[CCC]], October 26, 2016 » [[Stockfish]], [[Tapered Eval]]
 
* [http://www.talkchess.com/forum/viewtopic.php?t=61850 couple of questions about stockfish code ?] by [[Mahmoud Uthman]], [[CCC]], October 26, 2016 » [[Stockfish]], [[Tapered Eval]]
 +
==2020 ...==
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=73126 SIMD methods in TT probing and replacement] by [[Harm Geert Muller]], [[CCC]], February 20, 2020 » [[Transposition Table]]
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75862 CPU Vector Unit, the new jam for NNs...] by [[Srdja Matovic]], [[CCC]], November 18, 2020 » [[NNUE]]
  
 
=External Links=  
 
=External Links=  
Line 89: Line 109:
 
* [https://en.wikipedia.org/wiki/SWAR SWAR from Wikipedia]
 
* [https://en.wikipedia.org/wiki/SWAR SWAR from Wikipedia]
 
* [http://www.aggregate.org/SWAR/ The Aggregate: SWAR, SIMD Within A Register] by [[Hank Dietz]]
 
* [http://www.aggregate.org/SWAR/ The Aggregate: SWAR, SIMD Within A Register] by [[Hank Dietz]]
* [http://andythomason.com/lecture_notes/agp/agp_simd_programming.html Advanced game programing | Session 4 - Math libraries and SIMD] from [http://andythomason.com/lecture_notes/ Game programming lecture notes] by [[Andy Thomason]]
+
==[[x86]]/[[x86-64]]==  
==x86==  
 
 
* [https://en.wikipedia.org/wiki/MMX_%28instruction_set%29 MMX from Wikipedia]
 
* [https://en.wikipedia.org/wiki/MMX_%28instruction_set%29 MMX from Wikipedia]
 
* [https://en.wikipedia.org/wiki/3DNow 3DNow! from Wikipedia]
 
* [https://en.wikipedia.org/wiki/3DNow 3DNow! from Wikipedia]
Line 98: Line 117:
 
* [https://en.wikipedia.org/wiki/SSSE3 SSSE3 from Wikipedia]
 
* [https://en.wikipedia.org/wiki/SSSE3 SSSE3 from Wikipedia]
 
* [https://en.wikipedia.org/wiki/SSE4 SSE4 from Wikipedia]
 
* [https://en.wikipedia.org/wiki/SSE4 SSE4 from Wikipedia]
* [http://de.wikipedia.org/wiki/SSE4a SSE4a from Wikipedia]
+
* [https://de.wikipedia.org/wiki/SSE4a SSE4a from Wikipedia.de]
 
* [https://en.wikipedia.org/wiki/SSE5 SSE5 from Wikipedia]
 
* [https://en.wikipedia.org/wiki/SSE5 SSE5 from Wikipedia]
 
* [https://en.wikipedia.org/wiki/XOP_instruction_set XOP instruction set from Wikipedia]
 
* [https://en.wikipedia.org/wiki/XOP_instruction_set XOP instruction set from Wikipedia]
Line 105: Line 124:
 
* [http://developer.amd.com/cpu/Libraries/sseplus/Pages/default.aspx SSEPlus Project] from [http://developer.amd.com/pages/default.aspx AMD Developer Central]
 
* [http://developer.amd.com/cpu/Libraries/sseplus/Pages/default.aspx SSEPlus Project] from [http://developer.amd.com/pages/default.aspx AMD Developer Central]
 
* [http://sseplus.sourceforge.net/index.html SSEPlus Project Documentation]
 
* [http://sseplus.sourceforge.net/index.html SSEPlus Project Documentation]
==Other SIMD==  
+
==Other==  
* [http://www.arm.com/products/multimedia/neon/index.html ARM NEON Technology]
+
* [https://developer.arm.com/architectures/instruction-sets/simd-isas/neon SIMD ISAs | Neon – Arm Developer]
 
* [https://en.wikipedia.org/wiki/ARM_architecture#Advanced_SIMD_.28NEON.29 ARM NEON Technology from Wikipedia]
 
* [https://en.wikipedia.org/wiki/ARM_architecture#Advanced_SIMD_.28NEON.29 ARM NEON Technology from Wikipedia]
 +
* [https://developer.arm.com/architectures/instruction-sets/simd-isas/helium SIMD ISAs | Arm Helium technology – Arm Developer]
 
* [https://en.wikipedia.org/wiki/AltiVec AltiVec from Wikipedia]
 
* [https://en.wikipedia.org/wiki/AltiVec AltiVec from Wikipedia]
* [http://developer.apple.com/hardwaredrivers/ve/sse.html Hardware - SSE Performance Programming] from [http://developer.apple.com/index.html Apple Developer]
 
* [http://developer.apple.com/hardwaredrivers/ve/instruction_crossref.html Apple Instruction Cross-Reference] from [http://developer.apple.com/index.html Apple Developer]
 
 
==Misc==
 
==Misc==
 
* [https://en.wikipedia.org/wiki/Explicitly_parallel_instruction_computing Explicitly parallel instruction computing (EPIC) from Wikipedia]
 
* [https://en.wikipedia.org/wiki/Explicitly_parallel_instruction_computing Explicitly parallel instruction computing (EPIC) from Wikipedia]
Line 121: Line 139:
 
=References=  
 
=References=  
 
<references />
 
<references />
 
 
'''[[Programming|Up one Level]]'''
 
'''[[Programming|Up one Level]]'''

Revision as of 18:59, 18 November 2020

Home * Programming * SIMD and SWAR Techniques

x86, x86-64, as well as PowerPC and Power ISA v.2.03 processors provide Single Instructions on Multiple Data (SIMD), namely on vectors of floats, doubles or various integers, bytes, words, double words or quad words, available through assembly and compiler intrinsics. SIMD-applications related to computer chess cover bitboard computations and fill-algorithms like Dumb7Fill and Kogge-Stone Algorithm, as well as evaluation related stuff, like this SSE2 dot-product of 64 bits by a vector of 64 bytes.

SWAR as acronym for SIMD Within A Register was coined by Hank Dietz and Randell J. Fisher [2] . It is a processing model which applies SIMD parallel processing across sections of a CPU register, often vectors of smaller than byte-entities are processed in parallel prefix manner.

SIMD Instruction Sets

SWAR Arithmetic

To apply addition and subtraction on vectors of bit-aggregates or bit-field structures within a general purpose register, one has to take care carries and borrows don't wrap around. Thus the need to mask of all most significant bits (H) and add in two steps, one 'add' with MSB clear and one add modulo 2 aka 'xor' for the MSB itself. For bytewise (rankwise) math inside a 64-bit register, H is 0x8080808080808080 and L is 0x0101010101010101.

SWAR add z = x + y
    z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)
SWAR sub z = x - y
    z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)
SWAR average z = (x+y)/2 based on x + y = (x^y) + 2*(x&y)
    z = (x & y) + (((x ^ y) & ~L) >> 1)

Samples

Amazing, how similar these two SWAR- and parallel prefix wise routines are. Mirror horizontally and population count have in common to act on vectors of duos, nibbles and bytes. One swaps bits, duos and nibbles, while the second adds populations of them.

U64 mirrorHorizontal (U64 x) {
    const U64 k1 = C64(0x5555555555555555);
    const U64 k2 = C64(0x3333333333333333);
    const U64 k4 = C64(0x0f0f0f0f0f0f0f0f);
    x = ((x & k1) << 1) | ((x >> 1)  & k1);
    x = ((x & k2) << 2) | ((x >> 2)  & k2);
    x = ((x & k4) << 4) | ((x >> 4)  & k4);
    return x;
}
int popCount (U64 x) {
    const U64 k1 = C64(0x5555555555555555);
    const U64 k2 = C64(0x3333333333333333);
    const U64 k4 = C64(0x0f0f0f0f0f0f0f0f);
    x =   x             - ((x >> 1)  & k1);
    x =  (x & k2)       + ((x >> 2)  & k2);
    x = ( x             +  (x >> 4)) & k4 ;
    x = (x * C64(0x0101010101010101))>> 56;
    return (int) x;
}

See also

Publications

1987 ...

2000 ...

2010 ...

Manuals

AMD

NXP Semiconductors

Intel

Forum Posts

1999

2000 ...

2010 ...

2020 ...

External Links

x86/x86-64

Other

Misc

References

Up one Level