Parallel Search

From Chessprogramming wiki
Revision as of 16:04, 5 March 2019 by GerdIsenberg (talk | contribs)
Jump to: navigation, search

Home * Search * Parallel Search

Parallel scalability [1]

Parallel Search,
also known as Multithreaded Search or SMP Search, is a way to increase search speed by using additional processors. This topic that has been gaining popularity recently with multiprocessor computers becoming widely available. Utilizing these additional processors is an interesting domain of research, as traversing a search tree is inherently serial. Several approaches have been devised, with the most popular today being Young Brothers Wait Concept and Shared Hash Table aka Lazy SMP.

This page gives a brief summary of the different types. SMP algorithms are classified by their scalability (trend in search speed as the number of processors becomes large) and their speedup (change in time to complete a search). Typically, programmers use scaling to mean change in nodes per second (NPS) rates, and speedup to mean change in time to depth. Scaling and scalability are thus two different concepts.

Distributed Search

A subtype of parallel algorithms, distributed algorithms are algorithms designed to work in cluster computing and distributed computing environments, where additional concerns beyond the scope of "classical" parallel algorithms need to be addressed.

Shared Hash Table

see Main page: Shared Hash Table

This technique is a very simple approach to SMP. The implementation requires little more than starting additional processors. Processors are simply fed the root position at the beginning of the search, and each searches the same tree with the only communication being the transposition table. The gains come from the effect of nondeterminism. Each processor will finish the various subtrees in varying amounts of time, and as the search continues, these effects grow making the search trees diverge. The speedup is then based on how many nodes the main processor is able to skip from transposition table entries. Many programs used this if a "quick and dirty" approach to SMP is needed. It had the reputation of little speedup on a mere 2 processors, and to scale quite badly after this.

Lazy SMP

see Main page: Lazy SMP

Recent improvements by Daniel Homan [2] , Martin Sedlak [3] and others on Lazy SMP indicate that the algorithm scales quite well up to 8 cores and beyond [4] .

ABDADA

see Main page: ABDADA

ABDADA, Alpha-Bêta Distribué avec Droit d'Aînesse (Distributed Alpha-Beta Search with Eldest Son Right) is a loosely synchronized, distributed search algorithm by Jean-Christophe Weill [5] . It is based on the Shared Hash Table, and adds the number of processors searching this node inside the hash-table entry for better utilization - considering the Young Brothers Wait Concept.

Parallel Alpha-Beta

These algorithms divide the Alpha-Beta tree, giving different subtrees to different processors. Because alpha-beta is a serial algorithm, this approach is much more complex. However, these techniques also provide for the greatest gains from additional processors.

Principal Variation Splitting (PVS)

In Principal Variation Splitting (PVS), each node is expressed by a thread. A thread will spawn one child thread for each legal move. But data dependency specified by the algorithm exists among these threads: After getting a tighter bound from the thread corresponding to the PV node, the remaining threads are ready to run.

PVSplit.JPG

PV Splitting [6]

YBWC and Jamboree

The idea in Feldmann's Young Brothers Wait Concept (YBWC) [7] [8] as well in Kuszmaul's Jamboree Search [9] [10] [11] , is to search the first sibling node first before spawning the remaining siblings in parallel. This is based on the observations that the first move is either going to produce a cutoff (in which case processing sibling nodes is wasted effort) or return much better bounds. If the first move does not produce a cut-off, then the remaining moves are searched in parallel. This process is recursive.

Since the number of processors is not infinite the process of "spawning" work normally consists in putting it on some kind of "work to be done stack" where processors are free to grab work in FIFO fashion when there is no work to do. In YBW you would not "spawn" or place work on the stack until the first sibling is searched.

In their 1983 paper Improved Speedup Bounds for Parallel Alpha-Beta Search [12] , Raphael Finkel and John Philip Fishburn already gave the theoretical confirmation to the common sense wisdom that parallel resources should first be thrown into searching the first child. Assuming the tree is already in an approximation to best-first order, this establishes a good alpha value that can then be used to parallel search the later children. The algorithm in the 1982 Artificial Intelligence paper [13] , which Fishburn called the "dumb algorithm" in his 1981 thesis presentation [14] gives p^0.5 speedup with p processors, while the 1983 PAMI algorithm (the "smart algorithm") gives p^0.8 speedup for lookahead trees with the branching factor of chess.

Dynamic Tree Splitting (DTS)

Main page: Dynamic Tree Splitting

This algorithm, invented by the Cray Blitz team (including Robert Hyatt [15] ), is the most complex. Though this gives the best known scalability for any SMP algorithm, there are very few programs using it because of its difficulty of implementation.

Other Approaches

Many different approaches have been tried that do not directly split the search tree. These algorithms have not enjoyed popular success due to the fact that they are not scalable. Typical examples include one processor that evaluates positions fed to it by a searching processor, or a tactical search that confirms or refutes a positional search.

Taxonomy

Overview and taxonomy of parallel algorithms based on alpha-beta, given by Mark Brockington, ICCA Journal, Vol. 19: No. 3 in 1996 [16]

First
Described
Algorithm Author(s) Processor Hierarchy /
Control Distribution
Parallelism Possible
At These Nodes
Synchronisation Done
At These Nodes
1978 Parallel Aspiration Search Gérard M. Baudet Static/Centralized Root (αβ - Window) Root
1979 Mandatory Work First Selim Akl et al. Static/Centralized Type-1+3+

Leftmost child of 3

Bad Type-2
1980 Tree Splitting Raphael Finkel,
John Philip Fishburn
Static/Centralized Top k-ply Root
1981 PVS Tony Marsland,
Murray Campbell
Static/Centralized Type-1 Type-1
1983 Key Node Gary Lindstrom Static/Centralized Type-1+3+
Leftmost child of 3
Bad Type-2
1986 UIDPABS [17] Monroe Newborn Static/Centralized Root None
1987 DPVS Jonathan Schaeffer Dynamic/Centralized Type-1+3+2 Type 1+3+Bad 2
EPVS Robert Hyatt,
Bruce W. Suter,
Harry Nelson
Dynamic/Centralized Type-1+3 Type-1+3
Waycool Ed Felten,
Steve Otto
Dynamic/Distributed All, except Type-2
with no TT-Entry
Nodes with TT
& no cutoff
YBWC Rainer Feldmann Dynamic/Distributed Type-1+3+Bad 2 Type-1+Bad 2
1988 Dynamic Tree Splitting Robert Hyatt Dynamic/Distributed Type-1+3+Bad 2 Type-1+Bad 2
Bound-and-Branch Chris Ferguson,
Richard Korf
Dynamic/Distributed Type-1+3+Bad 2 Type-1+Bad 2
1990 Delayed Branch
Tree Expansion
Feng-hsiung Hsu Static/Centralized Type-1+3 Bad Type-2
1993 Frontier Splitting Paul Lu Dynamic/Distributed All Root
αβ* Vincent David Dynamic/Distributed Type-1+3 Type-1+3+Bad 2
1994 CABP [18] Van-Dat Cung Static/Centralized Type-1+3 Bad Type-2
Jamboree Bradley Kuszmaul Dynamic/Distributed Type-1+3+Bad 2 Type-1+Bad 2
1995 ABDADA Jean-Christophe Weill Dynamic/Distributed Type-1+3+Bad 2 Type-1+Bad 2
Dynamic Multiple PV-Split Tony Marsland,
Yaoqing Gao
Dynamic/Distributed Nodes within PV set Nodes within PV set
1996 APHID Mark Brockington,
Jonathan Schaeffer
Static/Centralized Top k-ply None

Other Considerations

Memory Design

Semaphores

During an parallel search, certain areas of memory must be protected to make sure processors do not write simultaneously and corrupt the data. Some type of semaphore system must be used. Semaphores access a piece of shared memory, typically an integer. When a processor wants to access protected memory, it reads the integer. If it is zero, meaning no other process is accessing the memory, then the processor attempts to make the integer nonzero. This whole process must be done atomically, meaning that the read, compare, and write are all done at the same time. Without this atomicity another processor could read the integer at the same time and both would see that they can freely access the memory.

In chess programs that use parallel alpha-beta, usually spinlocks are used. Because the semaphores are held for such short periods of time, processors want to waste as little time as possible after the semaphore is released before acquiring access. To do this, if the semaphore is held when a processor reaches it, the processor continuously reads the semaphore. This technique can waste a lot of processing time in applications with high contention, meaning that many processes try to access the same semaphore simultaneously. In chess programs, however, we want as much processing power as possible.

Spinlocks are sometimes implemented in assembly language because the operating system does not have an API for them.

Threads vs. Processes

There are two ways of utilizing the extra processing power of multiple CPUs, threads and processes. The difference between them is that threads share all memory in the program, but there are multiple threads of execution. In processes, all memory is local to each processor except memory that is explicitly shared. This means that in a threaded program, functions must pass around an extra argument specifying which thread is active, thus which board structure to use. When using processes, a single global board can be used that will be duplicated for each process.

Threads are more common, because they are easier to debug as well as implement, provided the program does not already have lots of global variables. Processes are favored by some because the need to explicitly share memory makes subtle bugs easier to avoid. Also, in processes, the extra argument to most functions is not needed.

Some programs that use threads:

Some programs that use processes:

Didactic Programs

See also

Publications

1950 ...

1970 ...

1980 ...

1981

  • John Philip Fishburn (1981). Analysis of Speedup in Distributed Algorithms. Ph.D. Thesis, pdf
  • Selim Akl, R.J. Doran (1981). A comparison of parallel implementations of the alpha-beta and Scout tree search algorithms using the game of checkers, Department of Computing and Information Science, Queen's University, Kingston, Ontario.
  • Tony Marsland, Murray Campbell (1981). Parallel Search of Strongly Ordered Game Trees. Technical Report TR 81-9, Department of Computing Science , University of Alberta, pdf

1982

1983

1984

1985 ...

1986

1987

1988

1989

1990 ...

1992

1993

1994

1995 ...

1996

1997

1998

1999

2000 ...

2001

2002

2003

2004

2005 ...

  • Jan Renze Steenhuisen (2005). Transposition-Driven Scheduling in Parallel Two-Player State-Space Search. Masters Thesis, pdf

2006

2007

2008

2009

2010 ...

2011

2012

2013

2014

2015 ...

2016

2017

Forum Posts

1995 ...

2000 ...

2005 ...

2008

2009

Results from UCT parallelization by Gian-Carlo Pascutto, CCC, March 11, 2009

2010 ...

2011

2012

2013

Measure of SMP scalability (sub-thread) by Ernest Bonnem, CCC, July 08, 2013

2014

2015 ...

Explanation for non-expert? by Louis Zulli, CCC, February 16, 2015 » Stockfish
Best Stockfish NPS scaling yet by Louis Zulli, CCC, March 02, 2015

2016

Re: Baffling multithreading scaling behavior by Robert Hyatt, CCC, September 07, 2016

2017

Approximate ABDADA by Peter Österlund, CCC, August 23, 2017 » ABDADA

External Links

Parallel Search

Parallel Computing

Scalability

Shared Memory

Shared Memory:

False sharing from Wikipedia

Cache

Cache:

MSI protocol from Wikipedia
MESI protocol from Wikipedia
MOESI protocol from Wikipedia
assembly - The prefetch instruction - Stack Overflow
Data Prefetch Support - GNU Project - Free Software Foundation (FSF)
Software prefetching considered harmful by Linus Torvalds, LWN.net, May 19, 2011

Concurrency and Synchronization

Synchronization:

Cooperating sequential processes (EWD 123)
A challenge to memory designers? (EWD 497)

Misc

References

  1. Super Linearity and the Bigger Machine from Dr. Dobb's
  2. Lazy SMP, part 2 by Daniel Homan, CCC, January 12, 2013
  3. Lazy SMP in Cheng by Martin Sedlak, CCC, February 02, 2015
  4. Re: A new chess engine : m8 (comming not so soon) by Peter Österlund, CCC, February 01, 2015
  5. Jean-Christophe Weill (1996). The ABDADA Distributed Minimax Search Agorithm. Proceedings of the 1996 ACM Computer Science Conference, pp. 131-138. ACM, New York, N.Y, reprinted ICCA Journal, Vol. 19, No. 1, zipped postscript
  6. Yaoqing Gao, Tony Marsland (1996). Multithreaded Pruned Tree Search in Distributed Systems. Journal of Computing and Information, 2(1), 482-492, pdf
  7. Rainer Feldmann, Peter Mysliwietz, Burkhard Monien (1991). A Fully Distributed Chess Program. Advances in Computer Chess 6, pdf
  8. Rainer Feldmann (1993). Game Tree Search on Massively Parallel Systems. Ph.D. Thesis, pdf
  9. Christopher F. Joerg, Bradley C. Kuszmaul (1994). Massively Parallel Chess, pdf
  10. Bradley C. Kuszmaul (1994). Synchronized MIMD Computing. Ph. D. Thesis, Massachusetts Institute of Technology, pdf
  11. Bradley C. Kuszmaul (1995). The StarTech Massively Parallel Chess Program. pdf
  12. Raphael Finkel, John Philip Fishburn (1983). Improved Speedup Bounds for Parallel Alpha-Beta Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 5, No. 1, pp. 89 - 92
  13. Raphael Finkel, John Philip Fishburn (1982). Parallelism in Alpha-Beta Search. Artificial Intelligence, Vol. 19, No. 1
  14. John Philip Fishburn (1981). Analysis of Speedup in Distributed Algorithms. Ph.D. Thesis, pdf
  15. Robert Hyatt (1994). The DTS high-performance parallel tree search algorithm
  16. Mark Brockington (1996). A Taxonomy of Parallel Game-Tree Search Algorithms. ICCA Journal, Vol. 19: No. 3
  17. UIDPABS = Unsynchronized Iteratively Deepening Parallel Alpha-Beta Search
  18. CABP = Concurrent Alpha-Beta Pruning
  19. It should also be noted that Crafty uses threads on Windows, and used processes on Unix
  20. threads vs processes again by Robert Hyatt, CCC, February 27, 2006
  21. Jonathan Schaeffer (1985). Lionel Moser: An Experiment in Distributed Game Tree Searching. ICCA Journal, Vol. 8, No. 2 (Review)
  22. An Introduction to Computer Chess by Alejandro López-Ortiz, 1993
  23. Nagging from Wikipedia
  24. Re: scorpio can run on 8192 cores by Daniel Shawul, CCC, August 29, 2015
  25. Transposition-driven scheduling - Wikipedia
  26. Transposition driven scheduling by Daniel Shawul, CCC, April 04, 2013
  27. Dovetailing (computer science) from Wikipedia
  28. Georg Hager's Blog | Random thoughts on High Performance Computing
  29. Re: Minmax backup operator for MCTS by Brahim Hamadicharef, CCC, December 30, 2017
  30. Message Passing Interface from Wikipedia
  31. John Romein, Henri Bal, Jonathan Schaeffer, Aske Plaat (2002). A Performance Analysis of Transposition-Table-Driven Scheduling in Distributed Search. IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 5, pp. 447–459. pdf
  32. Message Passing Interface from Wikipedia
  33. std::async - cppreference.com
  34. Parallel Search by [[Tom Kerrigan]
  35. "How To" guide to parallel-izing an engine by Tom Kerrigan, CCC, August 27, 2017

Up one level