Author: Vincent Diepeveen
Date: 16:38:53 09/08/02
Go up one level in this thread
On September 08, 2002 at 19:17:31, Vincent Diepeveen wrote: >On September 08, 2002 at 15:02:36, GuyHaworth wrote: > >> >>I don't think Rob Hyatt and I are in disagreement about the parallelisability or >>distributability of EGT generation. > >I can't imagine Bob saying that it's impossible to generate EGTBs at >a cluster, so in that case i have to agree with you saying it's possible. > >Note that at supercomputers i/o speed = RAM speed usually. > >All you need is bandwidth of course and as we see above that's no >real issue here. > >In case of multiprocessor machines it's even easier as you only have >to start a few more threads each pass. > >Please consider that a pass of KRBP KQR is really slow compared to >5 men. Each pass in itself you can already divide in several threads. > >The basic thing with 6 men is that they are very impracticle to use >in games. Suppose you have a harddisk with 150GB (or 2 terabyte in >case of Nalimov's DTMs) of 6 men *all 6 men* with win/draw/loss. That's >*right* now, pretty impracticle with regard to the access. I can't >afford in my pc a fast SCSI 150GB disk simply. Next year we will >see of course what the price is of them by then, but right now it's >just too expensive for me to buy it. > >That's the only reason i never generated them so far. Of course Nalimov >by now has a small advantage that he has already all pawnless ones, >but i wonder how he's going to store all the ones WITH pawns and what >time it takes to generate them for him. I need 4 GB of RAM to generate >the 6 men with pawns, so i basically waited for someone to offer a machine >with 4 GB ram to start generate them. > >beutlerhvac did :) > >>Distributability: an example. There is nothing to stop computer A generating >>KQQKQP while computer B generates KQQKRP. All they require is that KQQKQQ/R/B/N >>and KQQKRQ/R/B/N have been generated first. >> >>There is plenty of 'width' available in the lattice of endgames to allow >>independent generation of EGTs. >> >> >>Next, one can distinguish phases in the generation of an EGT: >> >>a) the initialisation of the EGT ... terminal and transition positions >> this is a major piece of work and commonly underestimated In fact that's the slowest pass for me indeed. My generator is not slow but harddisk speed is by far faster even at an IDE nowadays than the generator can initialize positions. I used in my old generator the trick to generate 64 positions at the same time by Urban Koistinen (modified implementation, same idea though) but it is not workable for 6 men as you need too much RAM for it otherwise things are no longer sequential but harddisk latency based and that slows down by a factor 1000 at least the generation. My generator can do several millions at a time, but consider this, a 6 men with just 1 pawn and no other mirrorring it's like 18G positions. That makes the initialization pass by far slowest. 18G / several million is several hours just to INITIALIZE and CREATE both files. Note that file size and such is no problem as long as you use NTFS (goes to 2^64 or something) and i was so happy to use BITBOARD definition everywhere already so no 32 bits problems here. Of course there is a lot slowdown because at a 32 bits processor 64 bits generator is hell slow, for initialization that's no issue though. Actual the basic issue is the inverse of an index. In the koistinen style generator i was really optimized for speed. I incremental had the board position. So creating the board position for index i was real difficult but not possible to get faster. What i do now is a very slow 'inverse index to position' function. Hell slow if i may use that expression here. Several hours is really f'ing slow. Note this real simple to do in parallel. the trick is to have 2 threads generate each 1 bitmap. >>b) [ optional phase but available with twin-sourcing: >> confirmation that the initialisation of the EGT-generation is correct ] This is not needed at all see later. >>c) retrograde or repeated-sweep generation of the EGT What do you mean here? You mean the passes that make moves and get the value from the other win/loss bitmaps, so basically the passes that generate mate in 1, 2, 3, 4 ,5 , 6... ? I'll assume so. That's less ideal to do in parallel when compared to pass b, but it's definitely pretty easy to do in parallel. I hope you realize that generating the egtbs in parallel is not a real form of SMP, because after each pass sequential code gets executed by the mainthread. That's however pretty easy combining a bitmap with the results you just had. Only speed limitation is sequential harddisk speed to generate and write and read bitmaps. That's real fast in short. >>d) independent validation of the integrity of the EGT generated that's a single extra pass after you first generated some CRCs which gets done with existing software. It's real simple to use existing software here :) then 1 pass that checks everything again. If that matches you're the hero if the crc is also the same. A small part of the work it is. >> >>Within 'c', a further opportunity for parallelisation exists >> >>c1) [ notional: divide the range of the EGT-index into N parts ] >> >>c2) have processor 'k' back-up the frontier positions in part k of the index >> >>Agreed: you would not start the next retro-phase until all N processors >>complete here. Shared memory is needed for this, not to mention multiple >>disc-drives. Of course fast harddisks help a lot obviously, the old IDE disks definitely are too slow here. Yet you don't need shared memory at all for some of the passes. Just bandwidth is what you need. Nevertheless that's only at clusters and NUMA machines not the same like shared memory. Shared memory makes programming simply quite a bit easier. However let's take the TERAS machine which is a NUMA machine. If it stream the data from node 1 to node 2, and then we *assume* it's the other side of the supercomputers nodes, which is a kind of worst case thing. Bandwidth is about 1 terabyte a second. Receiving nodes (dual processors) do it at their hub i/o speed which is 600MB/s (memory is a different hub and is also 600MB/s) so i hope you realize that getting next cache line of 128 bytes is that fast that you don't notice it even. A single bitmap is about 18G positoins divided by 8 bits is a bit more than 2.5GB or that's about sequential spoken 4 seconds and a very tiny bit to stream from node 1 to another node. That's peanuts compared to the tens of minutes to calculate all the results. I hope you realize that. So if bandwidth is big, it is no real issue. Of coruse programming a generator such that a cluster or NUMA machine can get used it is quite more difficult, obviously the NUMA has the advantage here because you don't need to write down the library functions at all to stream data from place a to b, it goes *automatically*. Just the design must be different so that the nodes are reading and writing at different parts in the different files. the shared memory concept here is quicker to implement. >> >>In summary, EGT-generation is as parallelisable as required: >>resource-availability is the only limit. >> >>g
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.