Difference between revisions of "Shared Hash Table"
GerdIsenberg (talk | contribs) |
GerdIsenberg (talk | contribs) |
||
Line 162: | Line 162: | ||
'''2016''' | '''2016''' | ||
* [http://www.talkchess.com/forum/viewtopic.php?t=58830 NUMA 101] by [[Robert Hyatt]], [[CCC]], January 07, 2016 » [[NUMA]] | * [http://www.talkchess.com/forum/viewtopic.php?t=58830 NUMA 101] by [[Robert Hyatt]], [[CCC]], January 07, 2016 » [[NUMA]] | ||
− | * [http://www.talkchess.com/forum/viewtopic.php?t=59389 Lazy SMP - how it works] by Kalyankumar Ramaseshan, [[CCC]], February 29, 2016 » [[Lazy SMP]] | + | * [http://www.talkchess.com/forum/viewtopic.php?t=59389 Lazy SMP - how it works] by [[Kalyankumar Ramaseshan]], [[CCC]], February 29, 2016 » [[Lazy SMP]] |
* [http://www.talkchess.com/forum/viewtopic.php?t=60122 lockless hashing] by Lucas Braesch, [[CCC]], May 10, 2016 | * [http://www.talkchess.com/forum/viewtopic.php?t=60122 lockless hashing] by Lucas Braesch, [[CCC]], May 10, 2016 | ||
* [http://www.talkchess.com/forum/viewtopic.php?t=61472 What do you do with NUMA?] by [[Matthew Lai]], [[CCC]], September 19, 2016 » [[NUMA]] | * [http://www.talkchess.com/forum/viewtopic.php?t=61472 What do you do with NUMA?] by [[Matthew Lai]], [[CCC]], September 19, 2016 » [[NUMA]] |
Revision as of 20:58, 19 July 2019
Home * Programming * Data * Hash Table * Shared Hash Table
Shared Hash Table,
a Hash table or Transposition table which is accessed by various processes or threads simultaneously, running on multiple processors or processor cores. Shared hash tables are most often implemented as dynamically allocated memory treated as global array. Due to memory protection between processes, they require an Application programming interface provided by the operating system to allocate shared memory. Threads may share global memory from the process they are belonging to.
Contents
Parallel Search
Almost all parallel search algorithms on SMP- or NUMA systems profit from probing hash entries written by other instances of the search, in its most simple form by instances of a sequential search algorithm which simultaneously search the same root position. The gains come from the effect of nondeterminism. Each processor will finish the various subtrees in varying amounts of time, and as the search continues, these effects grow making the search trees diverge. The speedup is then based on how many nodes the main processor is able to skip from transposition table entries. It had the reputation of little speedup on a mere 2 processors, and to scale quite badly after this. However, the NPS scaling is nearly perfect.
ABDADA
see Main page: ABDADA
ABDADA, Alpha-Bêta Distribué avec Droit d'Anesse (Distributed Alpha-Beta Search with Eldest Son Right) is a loosely synchronized, distributed search algorithm by Jean-Christophe Weill [2] . It is based on the Shared Hash Table, and adds the number of processors searching this node inside the hash-table entry for better utilization - considering the Young Brothers Wait Concept.
Lazy SMP
see Main page: Lazy SMP
Recent improvements by Daniel Homan [3], Martin Sedlak [4] and others on Lazy SMP indicate that the algorithm scales quite well up to 8 cores and beyond [5].
Concurrent Access
Due to its size, i.e. 16 or more bytes, writing and reading hash entries are none atomic and require multiple write- and read-cycles. It may and will happen that concurrent writes and reads at the same table address and almost same time results in corrupt data retrieved that causes significant problems to the search. Interrupts may occur between accesses, and there are further nondeterministic issues involved [6] which may cause one thread to read two or more atomic data items, which were written by different threads, searching different positions with the same hash-index due to type-2 errors.
Locks
One common solution to avoid such errors is synchronization using atomic locks, and to implement a critical section or mutual exclusion.
CilkChess
As an example, CilkChess used Cilk's support for atomicity [7]. It uses one lock per hash entry:
typedef struct { Cilk_lockvar lock; U64 key; U64 data; } ttentry; ttentry hashtable[TABLESIZE]; void ttinit ( ) { for (int i = 0; i < TABLESIZE; ++i) Cilk_lock_init( hashtable[i].lock); } void update_entry ( ttentry *e, U64 key, U64 data ) { Cilk_lock (e->lock); /* begin critical section */ e->key = key; e->data = data; ... Cilk_unlock (e->lock); /* end critical section */ }
Granularity
An important property of a lock is its granularity, which is a measure of the amount of data the lock is protecting. In general, choosing a coarse granularity (a small number of locks, each protecting a large segment of data) results in less lock overhead when a single process is accessing the protected data, but worse performance when multiple processes are running concurrently. This is because of increased lock contention. The more coarse the lock, the higher the likelihood that the lock will stop an unrelated process from proceeding, i.e. in the extreme case, one lock for the whole table. Conversely, using a fine granularity (a larger number of locks, each protecting a fairly small amount of data), like in the CilkChess sample above, increases the overhead of the locks themselves but reduces lock contention. For a huge transposition table with millions of fairly small entries locks incur a significant performance penalty on many architectures.
Lock-less
Xor
Robert Hyatt and Tim Mann proposed a lock-less transposition table implementation [8] for 128 bit entries with two atomic quad words, one qword for storing the key or signature, the 64-bit Zobrist- or BCH-key of the position, and one qword for the other information stored, move, score, draft and that like (data). Rather than to store two disjoint items, the key is stored xored with data, while data is stored additionally as usual. According to Robert Hyatt, the original idea came from Harry Nelson somewhere in 1990-1992 [9].
index = key % TABLESIZE; hashtable[index].key = key ^ data; hashtable[index].data = data;
Since the retrieving position requires the same key for a probing hit, the stored key xored by the retrieved key must match the stored data.
index = key % TABLESIZE; if (( hashtable[index].key ^ hashtable[index].data) == key ) { /* entry matches key */ }
If key and data were written simultaneously by different search instances with different keys, the error will usually yield in a mismatch of the comparison, except the rare but inherent Key collisions or type-1 errors [10]. As pointed out by Harm Geert Muller [11], the XOR technique might be applied for any size.
Checksum
For a lock-less shared Hash table with (much) larger entry sizes such as the Pawn Hash Table, one may store an additional checksum of the data, to likely detect errors after retrieving, and to safe the consistence of an entry.
SSE2
x86 and x86-64 SSE2 128-bit read/write instructions might in practice atomic, but they are not guaranteed even if properly aligned [12] [13]. If the processor implements a 16-byte store instruction internally as 2 8-byte stores in the store pipeline, it's perfectly possible for another processor to "steal" the cache line in between the two stores [14]. However, Intel states any locked instruction (either the XCHG instruction or another read-modify-write instruction with a LOCK prefix) appears to execute as an indivisible and uninterruptible sequence of load(s) followed by store(s) regardless of alignment [15] [16].
Allocation
Multiple threads inside one process can share its global variables or heap. Processes require special API calls to create shared memory and to pass a handle to other processes around for interprocess communication. POSIX provides a standardized API for using shared memory [17] . Linux kernel builds since 2.6 offer /dev/shm as shared memory in the form of a RAM disk.
See also
Publications
1980 ...
- Clyde Kruskal, Larry Rudolph, Marc Snir (1988). Efficient Synchronization on Multiprocessors with Shared Memory. ACM TOPLAS, Vol. 10
- Henri Bal (1989). The shared data-object model as a paradigm for programming distributed systems. Ph.D. thesis, Vrije Universiteit
1990 ...
- Maurice Herlihy (1991). Wait-free synchronization. ACM Transactions on Programming Languages and Systems Vol. 13 No. 1, pdf
- Vincent David (1993). Algorithmique parallèle sur les arbres de décision et raisonnement en temps contraint. Etude et application au Minimax = Parallel algorithm for heuristic tree searching and real-time reasoning. Study and application to the Minimax, Ph.D. Thesis, École nationale supérieure de l'aéronautique et de l'espace, Toulouse, France
- Abstract: The method of parallelization is based on a suppression of control between the search processes, in favor of a speculative parallelism and full sharing of information achieved through a physically distributed but virtually shared memory. The contribution of our approach for real-time distributed systems and fault-tolerant is evaluated through experimental results.
- Maged M. Michael, Michael L. Scott (1995). Implementation of Atomic Primitives on Distributed Shared Memory Multiprocessors. HPCA'95, pdf
- Paul Lu (1997). Aurora: Scoped Behaviour for Per-Context Optimized Distributed Data Sharing. 11th International Parallel Processing Symposium (IPPS) [19]
- John Romein, Aske Plaat, Henri Bal, Jonathan Schaeffer (1999). Transposition Table Driven Work Scheduling in Distributed Search. AAAI-99, pdf [20] [21]
2000 ...
- Paul Lu (2000). Scoped Behaviour for Optimized Distributed Data Sharing. Ph.D. thesis, University of Toronto
- Valavan Manohararajah (2001) Parallel Alpha-Beta Search on Shared Memory Multiprocessors. Masters Thesis, pdf
- John Romein, Henri Bal, Jonathan Schaeffer, Aske Plaat (2002). A Performance Analysis of Transposition-Table-Driven Scheduling in Distributed Search. IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 5, pp. 447–459. pdf [22]
- Maged M. Michael (2002). High Performance Dynamic Lock-Free Hash Tables and List-Based Sets. IBM Thomas J. Watson Research Center, pdf
- Robert Hyatt, Tim Mann (2002). A lock-less transposition table implementation for parallel search chess engines. ICGA Journal, Vol. 25, No. 1
- Jiří Barnat, Petr Ročkai (2007). Shared Hash Tables in Parallel Model Checking. Faculty of Informatics, Masaryk University, Brno, Czech Republic, PDMC 2007, Preliminary Version as pdf
- Vivek Sarkar (2008). Shared-Memory Parallel Programming with OpenMP. Rice University, slides as pdf
- Jouni Leppäjärvi (2008). A pragmatic, historically oriented survey on the universality of synchronization primitives. pdf
2010 ...
- Alfons Laarman, Jaco van de Pol, Michael Weber (2010). Boosting Multi-Core Reachability Performance with Shared Hash Tables. Formal Methods and Tools, University of Twente, The Netherlands, pdf
- John Mellor-Crummey (2011). Shared-memory Parallel Programming with Cilk. Rice University, slides as pdf » Cilk
- Anthony Williams (2012). C++ Concurrency in Action: Practical Multithreading. [23]
Forum Posts
1997 ...
- Parallel searching by Andrew Tridgell, rgcc, March 22, 1997 » KnightCap
- CilkChess question for Don by Robert Hyatt, CCC, January 31, 1999 » CilkChess
2000 ...
- Re: Atomic write of 64 bits by Frans Morsch, comp.lang.asm.x86, September 25, 2000
2005 ...
- multithreading questions by Martin Fierz, CCC, August 08, 2007
- If making an SMP engine, do NOT use processes by Zach Wegner, CCC, February 07, 2008
- threads vs processes by Robert Hyatt, CCC, July 16, 2008
- threads vs processes again by Robert Hyatt, CCC, August 05, 2008
- SMP hashing problem by Robert Hyatt, CCC, January 24, 2009
- Interlock clusters by Steven Edwards, CCC, January 25, 2009
2010 ...
- lockless hashing by Daniel Shawul, CCC, February 07, 2011
- On parallelization by Onno Garms, CCC, March 13, 2011
- cache alignment of tt by Daniel Shawul, CCC, March 11, 2012
- Speaking of the hash table by Ed Schroder, CCC, December 09, 2012
- Lazy SMP by Julien Marcel, CCC, December 27, 2012 » Lazy SMP
- Lazy SMP, part 2 by Daniel Homan, CCC, January 12, 2013
- Multi-threaded memory access by ThinkingALot, OpenChess Forum, February 10, 2013 » Memory, Thread
- Lazy SMP, part 3 by Daniel Homan, CCC, March 09, 2013
- Shared hash table smp result by Daniel Shawul, CCC, March 21, 2013
- Transposition driven scheduling by Daniel Shawul, CCC, April 04, 2013 [24]
- Lazy SMP and Work Sharing by Daniel Homan, CCC, July 03, 2013 » Lazy SMP in EXchess
- Lockless hash: Thank you Bob Hyatt! by Julien Marcel, CCC, August 25, 2013
- How could a compiler break the lockless hashing method? by Rein Halbersma, CCC, December 08, 2013
- Parallel Search with Transposition Table by Daylen Yang, CCC, March 27, 2014 » Parallel Search
- Two hash functions for distributed transposition table by Daniel Shawul, CCC, December 16, 2014
2015 ...
- Lazy SMP in Cheng by Martin Sedlak, CCC, February 02, 2015 » Cheng
- Trying to improve lazy smp by Daniel José Queraltó, CCC, April 11, 2015
- lazy smp questions by Lucas Braesch, CCC, September 09, 2015 » Lazy SMP
- atomic TT by Lucas Braesch, CCC, September 13, 2015
- lazy smp questions by Marco Belli, CCC, December 21, 2015 » Lazy SMP
2016
- NUMA 101 by Robert Hyatt, CCC, January 07, 2016 » NUMA
- Lazy SMP - how it works by Kalyankumar Ramaseshan, CCC, February 29, 2016 » Lazy SMP
- lockless hashing by Lucas Braesch, CCC, May 10, 2016
- What do you do with NUMA? by Matthew Lai, CCC, September 19, 2016 » NUMA
2017 ...
- Question about parallel search and race conditions by Michael Sherwin, CCC, September 11, 2017 » Parallel Search
- Prefetch and Threading by Dennis Sceviour, CCC, April 25, 2019 » Thread, Transposition Table
External Links
- Shared memory from Wikipedia
- Memory model from Wikipedia
- Information on the C++11 Memory Model by Scott Meyers, April 24, 2012
- Volatile variable from Wikipedia
- Memory ordering from Wikipedia
- Memory Ordering in Modern Microprocessors, Part I by Paul E. McKenney, Linux Journal, June 30, 2005
- Memory barrier from Wikipedia
- Parallel Random Access Machine from Wikipedia
- The Shared Memory Library (SharedMemoryLib) FAQ by Márcio Serolli Pinho
- Transactional memory from Wikipedia
- Software transactional memory from Wikipedia
- Cache coherence from Wikipedia
- Distributed shared memory from Wikipedia
- Memory-mapped file from Wikipedia
- Memory disambiguation from Wikipedia
- Memory dependence prediction from Wikipedia
- OpenMP from Wikipedia
- POSIX from Wikipedia
- shm_open, The Single UNIX Specification version 2, Copyright © 1997 The Open Group
- shmget(2): allocates shared memory segment - Linux man page » Linux
- mm(3): Shared Memory Allocation - Linux man page
- CreateSharedMemory, MSDN » Windows
- Chapter 9. Boost.Interprocess - Boost 1.36.0 by Ion Gaztañaga
- Threads and memory model for C++ by Hans J. Boehm
- IPC:Shared Memory by Dave Marshall, 1999
- Symmetric Multi-Processing (SMP) from Wikipedia » SMP
- Asymmetric multiprocessing from Wikipedia
- Uniform Memory Access from Wikipedia
- Non-Uniform Memory Access (NUMA) from Wikipedia » NUMA
- Optimizing Applications for NUMA | Intel® Developer Zone
- Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems (pdf)
Cache
- Cache from Wikipedia
- Cache (computing) from Wikipedia
- Functional Principles of Cache Memory by Paul V. Bolotoff, April 2007
- CPU cache from Wikipedia
- Cache-only memory architecture (COMA) from Wikipedia
- Cache coherence from Wikipedia
- False sharing from Wikipedia
- Cache coloring from Wikipedia
- Cache hierarchy from Wikipedia
- Cache-oblivious algorithm from Wikipedia
- Cache pollution from Wikipedia
- Cache prefetching from Wikipedia
- Prefetching from Wikipedia
- assembly - The prefetch instruction - Stack Overflow
- Data Prefetch Support - GNU Project - Free Software Foundation (FSF)
- Software prefetching considered harmful by Linus Torvalds, LWN.net, May 19, 2011
- Cache replacement policies from Wikipedia
- Page cache from Wikipedia
- Acumem SlowSpotter from Wikipedia
- Analyzing Efficiency of Shared and Dedicated L2 Cache in Modern Dual-Core Processors from iXBT Labs - Computer Hardware In Detail
- Scratchpad memory from Wikipedia
Concurrency and Synchronization
- I remember Edsger Dijkstra (1930 – 2002) « A Programmers Place by Maarten van Emden
- Concurrency from Wikipedia
- Category:Concurrency from Wikipedia
- Concurrency control from Wikipedia
- Optimistic concurrency control from Wikipedia
- Synchronization from Wikipedia
- Inter-process communication from Wikipedia
- Non-blocking algorithm from Wikipedia
- Linearizability from Wikipedia
- Monitor (synchronization)
- Lock from Wikipedia
- Busy waiting from Wikipedia
- Seqlock from Wikipedia
- Spinlock from Wikipedia
- Double-checked locking from Wikipedia
- Compare-and-swap from Wikipedia
- Test-and-set from Wikipedia
- Test and Test-and-set from Wikipedia
- Fetch-and-add from Wikipedia
- Barrier from Wikipedia
- Memory barrier from Wikipedia
- Critical section from Wikipedia
- Mutual exclusion from Wikipedia
- Semaphore from Wikipedia
- Transactional Synchronization Extensions from Wikipedia (Haswell)
- Readers-writers problem from Wikipedia
- Readers-writer lock from Wikipedia
- Read-copy-update from Wikipedia
- Producer-consumer problem from Wikipedia
- Dining philosophers problem from Wikipedia
- Cigarette smokers problem from Wikipedia
- Sleeping barber problem from Wikipedia
- Resource starvation from Wikipedia
- Deadlock from Wikipedia
- Anatomy of Linux synchronization methods by M. Tim Jones, IBM developerWorks, October 31, 2007
- The Little Book of Semaphores by Allen B. Downey
Distributed memory
- Koorde based on Chord and De Bruijn Sequence
Misc
- Cilk from Wikipedia » Cilk
- The Cilk Project from MIT
- Fetch-and-add from Wikipedia
- Intel Cilk Plus from Wikipedia
- XMTC from Wikipedia
- Ian Carr with Nucleus - Solar Plexus 1971, feat. Karl Jenkins and John Marshall, YouTube Video
References
- ↑ Dining philosophers problem from Wikipedia
- ↑ Jean-Christophe Weill (1996). The ABDADA Distributed Minimax Search Agorithm. Proceedings of the 1996 ACM Computer Science Conference, pp. 131-138. ACM, New York, N.Y, reprinted ICCA Journal, Vol. 19, No. 1, zipped postscript
- ↑ Lazy SMP, part 2 by Daniel Homan, CCC, January 12, 2013
- ↑ Lazy SMP in Cheng by Martin Sedlak, CCC, February 02, 2015
- ↑ Re: A new chess engine : m8 (comming not so soon) by Peter Österlund, CCC, February 01, 2015
- ↑ Memory disambiguation from Wikipedia
- ↑ Don Dailey, Charles E. Leiserson (2001). Using Cilk to Write Multiprocessor Chess Programs. Advances in Computer Games 9, pdf, 5 Other parallel programming issues, Cilk support for atomicity, pp. 17-18
- ↑ Robert Hyatt, Tim Mann (2002). A lock-less transposition table implementation for parallel search chess engines. ICGA Journal, Vol. 25, No. 1
- ↑ Re: Lockless hash: Thank you Bob Hyatt! by Robert Hyatt, CCC, August 26, 2013
- ↑ Robert Hyatt, Anthony Cozzie (2005). The Effect of Hash Signature Collisions in a Chess Program. ICGA Journal, Vol. 28, No. 3
- ↑ Re: lockless hashing by H.G.Muller, CCC, February 07, 2011
- ↑ Re: Effectively atomic read of 16 bytes on x86_64 without cmpxchg16b? by Anthony Williams, groups.google.lock-free, February 08, 2012
- ↑ Re: Speaking of the hash table by Ronald de Man, CCC, December 10, 2012
- ↑ SSE instructions: single memory access by Atom, Stack overflow, October 04, 2011
- ↑ Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A, section 8.2.3.1.
- ↑ Fetch-and-add from Wikipedia
- ↑ shm_open, The Single UNIX Specification version 2, Copyright © 1997 The Open Group
- ↑ Maged Michael - Selected Publications
- ↑ The Aurora Distributed Shared Data System
- ↑ Re: scorpio can run on 8192 cores by Daniel Shawul, CCC, August 29, 2015
- ↑ Transposition-driven scheduling - Wikipedia
- ↑ Transposition driven scheduling by Daniel Shawul, CCC, April 04, 2013
- ↑ Information on the C++11 Memory Model by Scott Meyers, April 24, 2012
- ↑ John Romein, Henri Bal, Jonathan Schaeffer, Aske Plaat (2002). A Performance Analysis of Transposition-Table-Driven Scheduling in Distributed Search. IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 5, pp. 447–459. pdf