MMX
MMX is a SIMD (Single instruction, multiple data) instruction set of x86 processors, starting in 1996 with Intel's Pentium MMX. In 1998, AMD enhanced Intel's MMX with the 3DNow! extension, mostly related to the Float data type. MMX instructions are available through Assembly language, inline assembly and C-Compiler intrinsics along with the _m64 intrinsic data type [2] .
Register File
MMX uses eight 64-bit registers MM0 through MM7, treated each as vector of eight bytes, four words, two double words or one quad word. The eight registers were aliased for the existing x87 FPU stack registers, and are therefor implicitly saved and restored during context switch in existing operating systems. The drawback is, it is somewhat difficult to work with x87 floating point and MMX data in the same application, since the original emms-instruction to switch the register file was relatively slow.
MMX and 64-bit Windows
Since 64-bit Windows applications merely use SSE for floating point arithmetic, there was some early confusion whether MMX/x87 registers are safe to use due to context switching. Quote from Agner Fog's Calling conventions manual: [3]
6.1 Can floating point registers be used in 64-bit Windows?
There has been widespread confusion about whether 64-bit Windows allows the use of the floating point registers ST(0)-ST(7) and the MM0 - MM7 registers that are aliased upon these. One early technical document found at Microsoft's website says x87/MMX registers are unavailable to Native Windows64 applications" (Rich Brunner: Technical Details Of Microsoft® Windows® For The AMD64 Platform, Dec. 2003). An AMD document says: "64-bit Microsoft Windows does not strongly support MMX and 3Dnow! instruction sets in the 64-bit native mode" (Porting and Optimizing Multimedia Codecs for AMD64 architecture on Microsoft® Windows®, July 21, 2004). A document in Microsoft's MSDN says: "A caller must also handle the following issues when calling a callee: [...] Legacy Floating-Point Support: The MMX and floating-point stack registers (MM0-MM7/ST0-ST7) are volatile. That is, these legacy floating-point stack registers do not have their state preserved across context switches" (MSDN: Kernel-Mode Driver Architecture: Windows DDK: Other Calling Convention Process Issues. Preliminary, June 14, 2004; February 18, 2005).
This description is nonsense because it confuses saving registers across function calls and saving registers across context switches. Some versions of the Microsoft assembler ml64 (e.g. v. 8.00.40310) gives the following message when attempts are made to use floating point registers in 64 bit mode: "error A2222: x87 and MMX instructions disallowed; legacy FP state not saved in Win64". However, a public discussion forum quotes the following answers from Microsoft engineers regarding this issue: "From: Program Manager in Visual C++ Group, Sent: Thursday, May 26, 2005 10:38 AM. It does preserve the state. It's the DDK page that has stale information, which I've requested it to be changed. Let them know that the OS does preserve state of x87 and MMX registers on context switches." and "From: Software Engineer in Windows Kernel Group, Sent: Thursday, May 26, 2005 11:06 AM. For user threads the state of legacy floating point is preserved at context switch. But it is not true for kernel threads. Kernel mode drivers can not use legacy floating point instructions."
The issue has finally been resolved with the long overdue publication of a more detailed ABI for x64 Windows in the form of a document entitled "x64 Software Conventions", well hidden in the bin directory (not the help directory) of some compiler packages. This document says: "The MMX and floating-point stack registers (MM0-MM7/ST0-ST7) are preserved across context switches. There is no explicit calling convention for these registers. The use of these registers is strictly prohibited in kernel mode code." The same text has later appeared at the Microsoft website [4].
Applications
Almost the same bitboard applications as mentioned in the SSE2 application samples are possible with MMX, despite with scalar bitboards rather than vector of two.
East Fill
For instance East Attacks based on SIMD-wise Fill by Subtraction.
__m64 eastAttacks (__m64 occ, __m64 rooks) { __m64 tmp; occ = _mm_or_si64 (occ, rooks); // make rooks member of occupied tmp = _mm_xor_si64(occ, rooks); // occ - rooks tmp = _mm_sub_pi8 (tmp, rooks); // occ - 2*rooks return _mm_xor_si64(occ, tmp); // occ ^ (occ - 2*rooks) }
MMX Popcount
AMD's proposed Efficient 64-Bit Population Count using MMX, 3DNow! and inline assembly [5] :
#include "amd3d.h" __declspec (naked) unsigned int __stdcall popcount64 (unsigned __int64 v) { static const __int64 C55 = 0x5555555555555555; static const __int64 C33 = 0x3333333333333333; static const __int64 C0F = 0x0F0F0F0F0F0F0F0F; __asm { MOVD MM0, [ESP+4] ;v_low PUNPCKLDQ MM0, [ESP+8] ;v MOVQ MM1, MM0 ;v PSRLD MM0, 1 ;v >> 1 PAND MM0, [C55] ;(v >> 1) & 0x55555555 PSUBD MM1, MM0 ;w = v - ((v >> 1) & 0x55555555) MOVQ MM0, MM1 ;w PSRLD MM1, 2 ;w >> 2 PAND MM0, [C33] ;w & 0x33333333 PAND MM1, [C33] ;(w >> 2) & 0x33333333 PADDD MM0, MM1 ;x = (w & 0x33333333) + ((w >> 2) & 0x33333333) MOVQ MM1, MM0 ;x PSRLD MM0, 4 ;x >> 4 PADDD MM0, MM1 ;x + (x >> 4) PAND MM0, [C0F] ;y = (x + (x >> 4) & 0x0F0F0F0F) PXOR MM1, MM1 ; 0 PSADBW MM0, MM1 ;sum across all 8 bytes MOVD EAX, MM0 ;result in EAX per calling ; convention FEMMS ;clear MMX state RET 8 ;pop 8-byte argument off } }
See also
Manuals
Intel
AMD
Forum Posts
- Using mmx instructions by Frans Morsch, comp.lang.asm.x86, February 03, 2000
- Re: Atomic write of 64 bits by Frans Morsch, comp.lang.asm.x86, September 25, 2000
- Re: Chezzz 1.0.1 - problem solved - for David Rasmussen by David Rasmussen, CCC, February 05, 2003 » Population Count, Chezzz
External Links
References
- ↑ Intel P5 (microarchitecture) from Wikipedia, Source: Sergei Frolov, Soviet Calculators Collection, September 2007
- ↑ MMX Technology Intrinsic Groups
- ↑ Calling conventions for different C++ compilers and operating systems (pdf) by Agner Fog
- ↑ Legacy Floating-Point Support (C++) from MSDN Library
- ↑ AMD Athlon Processor x86 Code Optimization Guide (pdf) Efficient 64-Bit Population Count Using MMX™ Instructions Page 184