X86-64
x86-64 or x64,
an 64-bit x86-extension, designed by AMD as Hammer- or K8 architecture with Athlon 64 and Opteron cpus. It has been cloned by Intel under the name EMT64 and later Intel 64. Beside 64-bit general purpose extensions, x86-64 supports MMX-, x87- as well as the 128-bit SSE- and SSE2-instruction sets. According to the CPUID-instructions, further SIMD Streamig Extensions, such as SSE3, SSSE3 (Intel only), SSE4 (Core2, K10), AVX, AVX2 and AVX-512, and AMD's 3DNow!, Enhanced 3DNow! and XOP.
Contents
Register File
x86-64 doubles the number of x86 general purpose- and XMM registers.
General Purpose
The 16 general purpose registers may be treated as 64 bit Quad Word (bitboard), 32 bit Double Word, 16 bit Word and high (partly), low Byte [2]:
64 | 32 | 16 | 8 high | 8 low | Purpose |
---|---|---|---|---|---|
RAX | EAX | AX | AH | AL | GP, Accumulator |
RBX | EBX | BX | BH | BL | GP, Index Register |
RCX | ECX | CX | CH | CL | GP, Counter, variable shift, rotate via CL |
RDX | EDX | DX | DH | DL | GP, high Accumulator mul/div |
RSI | ESI | SI | - | - | GP, Source Index |
RDI | EDI | DI | - | - | GP, Destination Index |
RSP | ESP | SP | - | - | Stack Pointer |
RBP | EBP | BP | - | - | GP, Base Pointer |
R08 | R08D | R08W | - | R08B | GP |
R.. | R..D | R..W | - | R..B | GP |
R15 | R15D | R15W | - | R15B | GP |
MMX
Eight 64-bit MMX-Registers: MM0 - MM7. Treated as Double, Quad Word or vector of two Floats, Double Words, vector if four Words or eight Bytes.
SSE/SSE*
Sixteen 128-bit XMM-Registers: XMM0 - XMM15. Treated as vector of two Doubles or Quad Words, as vector of four Floats or Double Words, and as vector of eight Words or 16 Bytes.
AVX, AVX2/XOP
Intel Sandy Bridge and AMD Bulldozer Sixteen 256-bit YMM-Registers: YMM0 - YMM15 (shared by XMM as lower half). Treated as vector of four Doubles or Quad Words, as vector of eight Floats or Double Words, and as vector of 15 Words or 32 Bytes.
AVX-512
Intel Xeon Phi (2015) 32 512-bit ZMM-Registers: ZMM0 - ZMM31 Eight vector mask registers
Instructions
Useful instructions for bitboard-applications are by default not supported by high-level programming languages. Available through (inline) Assembly or compiler intrinsics of various C-Compilers [3].
General Purpose
x86-64 Instructions, C-Intrinsic reference from x64 (amd64) Intrinsics List | Microsoft Docs
Mnemonic | Description | C-Intrinsic | Remark |
---|---|---|---|
bsf | bit scan forward | _BitScanForward64 | |
bsr | bit scan reverse | _BitScanReverse64 | |
bswap | byte swap | _byteswap_uint64 | |
bt | bit test | _bittest64 | |
btc | bit test and complement | _bittestandcomplement64 | |
btr | bit test and reset | _bittestandreset64 | |
bts | bit test and set | _bittestandset64 | |
cpuid | cpuid | _cpuid | cpuid |
imul | signed multiplication | _mulh, _mul128 | |
lzcnt | leading zero count | _lzcnt16, _lzcnt, _lzcnt64 | cpuid, SSE4a |
mul | unsigned multiplication | _umulh, _umul128 | |
popcnt | population count | _popcnt16, _popcnt, _popcnt64 | cpuid, SSE4.2, SSE4a |
rdtsc | read performance counter | _rdtsc | |
rol, ror | rotate left, right | _rotl, _rotl64, _rotr, _rotr64 |
Bit-Manipulation
SSE2
x86 and x86-64 - SSE2 Instructions, C-Intrinsic reference from Intel Intrinsics Guide
Mnemonic | Description | C-Intrinsic | ||
---|---|---|---|---|
bitwise logical | return | parameter | ||
pand | packed and, r := a & b | _m128i | _mm_and_si128 | (_m128i a, _m128i b) |
pandn | packed and not, r := ~a & b | _m128i | _mm_andnot_si128 | (_m128i a, _m128i b) |
por | packed or, r := a | b | _m128i | _mm_or_si128 | (_m128i a, _m128i b) |
pxor | packed xor, r:= a ^ b | _m128i | _mm_xor_si128 | (_m128i a, _m128i b) |
quad word shifts | return | parameter | ||
psrlq | packed shift right logical quad | _m128i | _mm_srl_epi64 | (_m128i a, _m128i cnt) |
immediate | _m128i | _mm_srli_epi64 | (_m128i a, int cnt) | |
psllq | packed shift left logical quad | _m128i | _mm_sll_epi64 | (_m128i a, _m128i cnt) |
immediate | _m128i | _mm_slli_epi64 | (_m128i a, int cnt) | |
arithmetical | return | parameter | ||
paddb | packed add bytes | _m128i | _mm_add_epi8 | (_m128i a, _m128i b) |
psubb | packed subtract bytes | _m128i | _mm_sub_epi8 | (_m128i a, _m128i b) |
psadbw | packed sum of absolute differences of bytes into a word |
_m128i | _mm_sad_epu8 | (_m128i a, _m128i b) |
pmaxsw | packed maximum signed words | _m128i | _mm_max_epi16 | (_m128i a, _m128i b) |
pmaxub | packed maximum unsigned bytes | _m128i | _mm_max_epu8 | (_m128i a, _m128i b) |
pminsw | packed minimum signed words | _m128i | _mm_min_epi16 | (_m128i a, _m128i b) |
pminub | packed minimum unsigned bytes | _m128i | _mm_min_epu8 | (_m128i a, _m128i b) |
pcmpeqb | packed compare equal bytes | _m128i | _mm_cmpeq_epi8 | (_m128i a, _m128i b) |
pmullw | packed multiply mow signed (unsigned) word | _m128i | _mm_mullo_epi16 | (_m128i a, _m128i b) |
pmulhw | packed multiply high signed word | _m128i | _mm_mulhi_epi16 | (_m128i a, _m128i b) |
pmulhuw | packed multiply high unsigned word | _m128i | _mm_mulhi_epu16 | (_m128i a, _m128i b) |
pmaddwd | packed multiply words and add doublewords | _m128 | _mm_madd_epi16 | (_m128i a, _m128i b) |
unpack, shuffle | return | parameter | ||
punpcklbw | unpack and interleave low bytesgGhHfFeE:dDcCbBaA := xxxxxxxx:GHFEDCBA # xxxxxxxx:ghfedcba
|
_m128i | _mm_unpacklo_epi8 | (_m128i A, _m128i a) |
punpckhbw | unpack and interleave high bytesgGhHfFeE:dDcCbBaA := GHFEDCBA:xxxxxxxx # ghfedcba:xxxxxxxx
|
_m128i | _mm_unpackhi_epi8 | (_m128i A, _m128i a) |
punpcklwd | unpack and interleave low wordsdDcC:bBaA := xxxx:DCBA#xxxx:dcba
|
_m128i | _mm_unpacklo_epi16 | (_m128i A, _m128i a) |
punpckhwd | unpack and interleave high wordsdDcC:bBaA := DCBA:xxxx#dcba:xxxx
|
_m128i | _mm_unpackhi_epi16 | (_m128i A, _m128i a) |
punpckldq | unpack and interleave low doublewordsbB:aA := xx:BA # xx:ba
|
_m128i | _mm_unpacklo_epi32 | (_m128i A, _m128i a) |
punpckhdq | unpack and interleave high doublewordsbB:aA := BA:xx # ba:xx
|
_m128i | _mm_unpackhi_epi32 | (_m128i A, _m128i a) |
punpcklqdq | unpack and interleave low quadwordsa:A := x:A # x:a
|
_m128i | _mm_unpacklo_epi64 | (_m128i A, _m128i a) |
punpckhqdq | unpack and interleave high quadwordsa:A := A:x # a:x
|
_m128i | _mm_unpackhi_epi64 | (_m128i A, _m128i a) |
pshuflw | packed shuffle low words | _m128i | _mm_shufflelo_epi16 | (_m128i a, int imm) |
pshufhw | packed shuffle high words | _m128i | _mm_shufflehi_epi16 | (_m128i a, int imm) |
pshufd | packed shuffle doublewords | _m128i | _mm_shuffle_epi32 | (_m128i a, int imm) |
load, store, moves | return | parameter | ||
movdqa | move aligned double quadword xmm := *p |
_m128i | _mm_load_si128 | (_m128i const *p) |
movdqu | move unaligned double quadword xmm := *p |
_m128i | _mm_loadu_si128 | (_m128i const*p) |
movdqa | move aligned double quadword *p := xmm |
void | _mm_store_si128 | (_m128i *p, _m128i a) |
movdqu | move unaligned double quadword *p := xmm |
void | _mm_storeu_si128 | (_m128i *p, _m128i a) |
movq | move quadword, xmm := gp64 | _m128i | _mm_cvtsi64_si128 | (_int64 a) |
movq | move quadword, gp64 := xmm | _int64 | _mm_cvtsi128_si64 | (_m128i a) |
movd | move double word or quadword xmm := gp64 |
_m128i | _mm_cvtsi64x_si128 | (_int64 value) |
movd | move doubleword, xmm := gp32 | _m128i | _mm_cvtsi32_si128 | (int a) |
movd | move doubleword, gp32 := xmm | int | _mm_cvtsi128_si32 | (_m128i a) |
pextrw | extract packed word, gp16 := xmm[i] | int | _mm_extract_epi16 | (_m128i a, int imm) |
pinsrw | packed insert word, xmm[i] := gp16 | _m128i | _mm_insert_epi16 | (_m128i a, int b, int imm) |
pmovmskb | packed move mask byte, gp32 := 16 sign-bits(xmm) |
int | _mm_movemask_epi | (_m128i a) |
cache support | return | parameter | ||
prefetch | void | _mm_prefetch | (char const* p , int i) |
Software
Operating Systems
Development
Assembly
C-Compiler
See also
Publications
- Georg Hager [5], Jan Treibig, Gerhard Wellein (2013). The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems. RRZE, SC13, slides as pdf
- S. Ali Mirsoleimani, Aske Plaat, Jaap van den Herik, Jos Vermaseren (2014). Performance analysis of a 240 thread tournament level MCTS Go program on the Intel Xeon Phi. CoRR abs/1409.4297 » Go
- S. Ali Mirsoleimani, Aske Plaat, Jaap van den Herik, Jos Vermaseren (2015). Scaling Monte Carlo Tree Search on Intel Xeon Phi. CoRR abs/1507.04383 » Hex, MCTS, Parallel Search
Manuals
Agner Fog
AMD
Instructions
- Volume 1: Application Programming (pdf)
- Volume 2: System Programming (pdf)
- Volume 3: General-Purpose and System Instructions (pdf)
- Volume 4: 128-Bit and 256-Bit Media Instructions (pdf)
- Volume 5: 64-Bit Media and x87 Floating-Point Instructions (pdf)
Optimization Guides
- Software Optimization Guide for AMD64 Processors (pdf)
- Software Optimization Guide for AMD Family 15h Processors (pdf)
- Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems (pdf)
Intel
Instructions
- Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2A: Instruction Set Reference, A-M (pdf)
- Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B: Instruction Set Reference, N-Z (pdf)
- Intel-AVX-Programming-Reference (pdf)
Optimization Guides
Forum Posts
2003 ...
- IA-64 vs OOOE (attn Taylor, Hyatt) by Tom Kerrigan, CCC, February 11, 2003 » Itanium
- Opteron NUMA/SMP question by Matthew Hull, CCC, February 09, 2005 » NUMA, SMP
- core2 popcnt by Frank Phillips, CCC, February 13, 2009 » Population Count
2010 ...
- Ivy Bridge vs Sandy Bridge for computer chess by Larry Kaufman, CCC, September 15, 2012
- What is your take on AMD's new processor? by Tano-Urayoan Russi Roman, CCC, October 24, 2012
- Intel i3 L2 cache by Harm Geert Muller, CCC, January 28, 2014 » Memory [6]
- Core Port Saturation by Natale Galioto, CCC, April 14, 2014
2015 ...
- syzygy users (and Ronald) by Robert Hyatt, CCC, September 29, 2016 » BitScan, Population Count
- New AMD processors by Ingo Althöfer, The Computer-go Archives, March 03, 2017
- Ryzen and BMI2: Strange behavior and high latencies by DonnieTinyHands, Reddit, March 20, 2017 » AMD, BMI2
- Is anyone here already using a Ryzen 1800X processor ? by Aloisio Ponti, CCC, March 26, 2017 » AMD
- Intel CPU performance-loss by security-patch?!? by Stefan Pohl, CCC, January 03, 2018
- Re: Komodo 11.3 by Mark Lefler, CCC, March 04, 2018 » AMD, BMI2 PEXT, Komodo 11.3
- Some x64 assembler for the curious by Michael Sherwin, CCC, March 22, 2019 » Assembly
- Ryzen problems - AGAIN! by noobpwnftw, CCC, October 22, 2019
2020 ...
- Intel AMX with TMUL on Xeon Sapphire Rapids (2021?) by Srdja Matovic, CCC, July 05, 2020 » AMX
- Can somebody compare the AMD Ryzen processors to the intel processors by George Pichard, CCC, March 24, 2021
External Links
- x86-64 from Wikipedia
- x86-64 calling conventions from Wikipedia
- x86 Addressing modes from Wikipedia
- X32 ABI from Wikipedia [7]
- Stack frame layout on x86-64 from Eli Bendersky's website, September 06, 2011 » Stack
- Introduction to x64 Assembly by Chris Lomont, March 2012
AMD
- List of AMD CPU microarchitectures from Wikipedia
- AMD K8 from Wikipedia
- Athlon 64
- Athlon 64 FX
- Opteron
- Athlon 64 X2 dual-core
- Turion 64 X2 dual-core
- Inside AMD's Hammer: the 64-bit architecture behind the Opteron and Athlon 64 by Jon Stokes, ars technica, February 01, 2005
- Understanding the detailed Architecture of AMD's 64 bit Core by Hans de Vries, September 21, 2003
- AMD K8 from 7-Zip LZMA Benchmark
- AMD K9 from Wikipedia
- AMD 10h from Wikipedia
- AMD K10 (Phenom) from 7-Zip LZMA Benchmark
- Phenom triple-core, quad-core
- Bobcat (microarchitecture) from Wikipedia
- Bulldozer (microarchitecture) from Wikipedia
- Piledriver (microarchitecture) from Wikipedia
- Steamroller (microarchitecture) from Wikipedia
- Excavator (microarchitecture) from Wikipedia
- Zen (microarchitecture) from Wikipedia
- Zen (first generation microarchitecture) from Wikipedia
- Zen+ from Wikipedia
- Zen 2 from Wikipedia
- Zen 3 from Wikipedia
- Zen 4 from Wikipedia
Intel
- List of Intel CPU microarchitectures from Wikipedia
- EMT64 from Wikipedia
- Tick-Tock model from Wikipedia
- Intel Core (microarchitecture from Wikipedia
- Intel Atom from Wikipedia
- Nehalem (microarchitecture) from Wikipedia
- Sandy Bridge (microarchitecture) from Wikipedia
- Intel Sandy Bridge from 7-Zip LZMA Benchmark
- Ivy Bridge (microarchitecture) from Wikipedia
- Intel Ivy Bridge from 7-Zip LZMA Benchmark
- Haswell (microarchitecture) from Wikipedia
- Intel Haswell from 7-Zip LZMA Benchmark
- Intel's Haswell CPU Microarchitecture by David Kanter, November 13, 2012
- Broadwell (microarchitecture) from Wikipedia
- Skylake (microarchitecture) from Wikipedia
- Kaby Lake from Wikipedia
- Xeon Phi from Wikipedia
Instruction Sets
- x87 from Wikipedia
- MMX from Wikipedia
- 3DNow! from Wikipedia
- Streaming SIMD Extensions from Wikipedia
- SSE2 from Wikipedia » SSE2
- SSE3 from Wikipedia » SSE3
- SSSE3 from Wikipedia » SSSE3
- SSE4 from Wikipedia » SSE4
- SSE4a from Wikipedia
- SSE5 from Wikipedia » SSE5
- XOP instruction set from Wikipedia » XOP
- Advanced Vector Extensions (AVX) from Wikipedia » AVX
- Transactional Synchronization Extensions (TSX) from Wikipedia (Haswell)
- Intel Intrinsics Guide
- Advanced Matrix Extension (AMX) - x86 - WikiChip
- Bit manipulation instruction set from Wikipedia
Security Vulnerability
- Meltdown (security vulnerability) from Wikipedia
- Spectre (security vulnerability) from Wikipedia
- Project Zero: Reading privileged memory with a side-channel by Jann Horn, Project Zero, January 03, 2018
References
- ↑ Die shot of AMD Opteron quad-core processor, Wikimedia Commons
- ↑ Introduction to x64 Assembly | Intel® Software
- ↑ Intel(R) C++ Compiler User and Reference Guides covers Intrinsics
- ↑ Advanced Matrix Extension (AMX) - x86 - WikiChip
- ↑ Georg Hager's Blog | Random thoughts on High Performance Computing
- ↑ Intel Nehalem Core i3
- ↑ Application binary interface from Wikipedia