Computer Chess Club Archives


Search

Terms

Messages

Subject: point by point

Author: Vincent Diepeveen

Date: 16:38:53 09/08/02

Go up one level in this thread


On September 08, 2002 at 19:17:31, Vincent Diepeveen wrote:

>On September 08, 2002 at 15:02:36, GuyHaworth wrote:
>
>>
>>I don't think Rob Hyatt and I are in disagreement about the parallelisability or
>>distributability of EGT generation.
>
>I can't imagine Bob saying that it's impossible to generate EGTBs at
>a cluster, so in that case i have to agree with you saying it's possible.
>
>Note that at supercomputers i/o speed = RAM speed usually.
>
>All you need is bandwidth of course and as we see above that's no
>real issue here.
>
>In case of multiprocessor machines it's even easier as you only have
>to start a few more threads each pass.
>
>Please consider that a pass of KRBP KQR is really slow compared to
>5 men. Each pass in itself you can already divide in several threads.
>
>The basic thing with 6 men is that they are very impracticle to use
>in games. Suppose you have a harddisk with 150GB (or 2 terabyte in
>case of Nalimov's DTMs) of 6 men *all 6 men* with win/draw/loss. That's
>*right* now, pretty impracticle with regard to the access. I can't
>afford in my pc a fast SCSI 150GB disk simply. Next year we will
>see of course what the price is of them by then, but right now it's
>just too expensive for me to buy it.
>
>That's the only reason i never generated them so far. Of course Nalimov
>by now has a small advantage that he has already all pawnless ones,
>but i wonder how he's going to store all the ones WITH pawns and what
>time it takes to generate them for him. I need 4 GB of RAM to generate
>the 6 men with pawns, so i basically waited for someone to offer a machine
>with 4 GB ram to start generate them.
>
>beutlerhvac did :)
>
>>Distributability:  an example.  There is nothing to stop computer A generating
>>KQQKQP while computer B generates KQQKRP.  All they require is that KQQKQQ/R/B/N
>>and KQQKRQ/R/B/N have been generated first.
>>
>>There is plenty of 'width' available in the lattice of endgames to allow
>>independent generation of EGTs.
>>
>>
>>Next, one can distinguish phases in the generation of an EGT:
>>
>>a)  the initialisation of the EGT ... terminal and transition positions
>>       this is a major piece of work and commonly underestimated

In fact that's the slowest pass for me indeed. My generator is not
slow but harddisk speed is by far faster even at an IDE nowadays than
the generator can initialize positions. I used in my old generator the
trick to generate 64 positions at the same time by Urban Koistinen (modified
implementation, same idea though) but it is not workable for 6 men
as you need too much RAM for it otherwise things are no longer sequential
but harddisk latency based and that slows down by a factor 1000 at least
the generation.

My generator can do several millions at a time, but consider this,
a 6 men with just 1 pawn and no other mirrorring it's like 18G positions.

That makes the initialization pass by far slowest. 18G / several million
is several hours just to INITIALIZE and CREATE both files. Note that
file size and such is no problem as long as you use NTFS (goes to 2^64
or something) and i was so happy to use BITBOARD definition everywhere
already so no 32 bits problems here. Of course there is a lot slowdown
because at a 32 bits processor 64 bits generator is hell slow, for
initialization that's no issue though.

Actual the basic issue is the inverse of an index. In the koistinen style
generator i was really optimized for speed. I incremental had the board
position. So creating the board position for index i was real difficult
but not possible to get faster.

What i do now is a very slow 'inverse index to position' function. Hell
slow if i may use that expression here.

Several hours is really f'ing slow. Note this real simple to do
in parallel. the trick is to have 2 threads generate each 1 bitmap.


>>b)  [ optional phase but available with twin-sourcing:
>>      confirmation that the initialisation of the EGT-generation is correct ]

This is not needed at all see later.

>>c)  retrograde or repeated-sweep generation of the EGT

What do you mean here? You mean the passes that make moves and get the
value from the other win/loss bitmaps, so basically the passes that
generate mate in 1, 2, 3, 4 ,5 , 6... ?
I'll assume so. That's less ideal to do in parallel when compared
to pass b, but it's definitely pretty easy to do in parallel.

I hope you realize that generating the egtbs in parallel is not
a real form of SMP, because after each pass sequential code gets
executed by the mainthread. That's however pretty easy combining
a bitmap with the results you just had. Only speed limitation is
sequential harddisk speed to generate and write and read bitmaps.

That's real fast in short.

>>d)  independent validation of the integrity of the EGT generated

that's a single extra pass after you first generated some CRCs
which gets done with existing software. It's real simple to use
existing software here :)

then 1 pass that checks everything again. If that matches you're the
hero if the crc is also the same.

A small part of the work it is.

>>
>>Within 'c', a further opportunity for parallelisation exists
>>
>>c1)  [ notional:  divide the range of the EGT-index into N parts ]
>>
>>c2)  have processor 'k' back-up the frontier positions in part k of the index
>>
>>Agreed:  you would not start the next retro-phase until all N processors
>>complete here.  Shared memory is needed for this, not to mention multiple
>>disc-drives.

Of course fast harddisks help a lot obviously, the old IDE disks definitely
are too slow here.

Yet you don't need shared memory at all for some of the passes. Just
bandwidth is what you need. Nevertheless that's only at clusters and NUMA
machines not the same like shared memory.

Shared memory makes programming simply quite a bit easier.

However let's take the TERAS machine which is a NUMA machine. If it
stream the data from node 1 to node 2, and then we *assume* it's the
other side of the supercomputers nodes, which is a kind of worst case
thing.

Bandwidth is about 1 terabyte a second. Receiving nodes (dual processors)
do it at their hub i/o speed which is 600MB/s (memory is a different hub
and is also 600MB/s) so i hope you realize that getting next cache
line of 128 bytes is that fast that you don't notice it even. A
single bitmap is about 18G positoins divided by 8 bits is a bit
more than 2.5GB or that's about sequential spoken 4 seconds and a very
tiny bit to stream from node 1 to another node.

That's peanuts compared to the tens of minutes to calculate all the results.

I hope you realize that.

So if bandwidth is big, it is no real issue. Of coruse programming a generator
such that a cluster or NUMA machine can get used it is quite more difficult,
obviously the NUMA has the advantage here because you don't need to write
down the library functions at all to stream data from place a to b, it goes
*automatically*. Just the design must be different so that the nodes
are reading and writing at different parts in the different files.

the shared memory concept here is quicker to implement.

>>
>>In summary, EGT-generation is as parallelisable as required:
>>resource-availability is the only limit.
>>
>>g



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.