Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Crafty profits little from Itanium and Opteron versus Commercials

Author: Gerd Isenberg
Date: 00:00:01 08/08/03
On August 07, 2003 at 23:30:36, Vincent Diepeveen wrote:

>On August 07, 2003 at 17:32:42, Gerd Isenberg wrote:
>
>>On August 07, 2003 at 16:16:25, Sune Fischer wrote:
>>
>>>On August 07, 2003 at 15:40:33, Gerd Isenberg wrote:
>>>
>>>>On August 07, 2003 at 08:24:28, Sune Fischer wrote:
>>>>
>>>>>On August 07, 2003 at 08:15:08, Uri Blass wrote:
>>>>>
>>>>>>>Crafty is 64 bit prog, which means it's slow on 32 bit, even I have found that
>>>>>>>doing a lookup is faster than shifting, I simply never do 1<<sq, I use a table
>>>>>>>for that.
>>>>>>
>>>>>>I guess that it is only for 64 bits and if you have 32 bits number then it is
>>>>>>better to do 1<<i when 0<=i<32 and not to use arrays.
>>>>>>
>>>>>>Correct?
>>>>>
>>>>>If you can do the shift in 1 clock, then you can't go any faster, but 64 bit
>>>>>shifts are slow on old 32 bit chips so the table becomes faster.
>>>>>
>>>>>So for pure 64 bit you get fewer tables, faster and cleaner code.
>>>>>
>>>>>-S.
>>>>>
>>>>
>>>>Hi Sune,
>>>>
>>>>Exactly! On the other hand, i believe that there is no need to use 64 bits
>>>>everywhere, if 32 bits are enough. Using the standard six 32-bit register set is
>>>>still fine with Opteron and one byte shorter opcode due to missing REX prefix.
>>>>
>>>>I don't know sizeof(int) in AMD64 compilers, still 4, or 8 per default.
>>>>But of course there are explicite 32- or 64-bit types, signed as well as
>>>>unsigned.
>>>>
>>>>I'm strained about what is the fastest 64-bitscan on opteron, specially if two
>>>>scans should be done simultaneously e.g. to get a move from/to index:
>>>>
>>>>1. Matt Taylor's 64-bit mul with de Bruijn sequence.
>>>>2. Folded 32-bit mul with Matt's super magic de Bruijn sequence.
>>>>3. bsf, still vector path and 9 cycles.
>>>>
>>>>But i have to wait some time, until i can try it ;-(
>>>
>>>Well you're the expert, I just hope you post your findings here :)
>>>
>>>One thing I'm very interested in, is if floodfillers will be fast enough to
>>>replace rotated. It would be nice if getting the bit wasn't needed, also to do
>>>away with the incrementally updated occupied rotated boards.
>>>What a "pure" code that would be :)
>>>
>>
>>Not quite sure, Sune.
>>
>>So many promisting options with opteron including rotated ;-)
>>
>>What about this approach?
>>
>>Kogge-Stone propagators in MMX, generarors in sixteen 128-bit XMM, eg.
>>simultaneously for default material:
>>black:white rook1,rook2,queen as rook but white:black king as rook meta slider,
>>black:white bishops,queen as bishop but white:black king as bishop meta slider.
>>
>>Two opposite direction parallel or interlaced. Pinned pieces or covered checker
>>on the fly with some and/ors. Very easy unconditional stuff, 5 up to 8 or more
>>independend MMX/SSE2-instructions in a row. There is only some const* source
>>pointer (rsi) and target-structure/class pointer (rdi) for intermediate attack
>>results for later eval and movegen/sorting SEE use. As well disjoint directions
>>attacks and disjoint piece attacks.
>>
>>May be "en passant" and out of order some gp-register processing to keep the
>>pipes really busy, some easy pawn or knight stuff in C.
>>
>>Of course there are several incarnations of this routine, eg. a more expensive,
>>but general one for all cases of not usual material with more than a queen per
>>side, more than two rooks, bishops or knights, more than one bishop on same
>>colored squares, and even cheaper one e.g. for pawn/knight endings, where pins
>>are not possible. One initial material dependent switch as the one and only
>>condition here.
>>
>>Due to the amount of information, disjoint and aggregated output of these
>>routines, a legal move generator based approach may outperform rotated,
>>specially when double direct path sse2 instructions became single in the future.
>>This routine is even a nice place to put a prefetch instruction before.
>>
>>Even rotated i consider fast as hell on opteron with 32KByte lookup or less.
>>
>>Gerd
>
>you can already calculate a minimum number of cycles you lose to get
>in a NORMAL register 1 move and then add the store penalty to it.
>

Do you mean mov mmx/xmm <-> reg64?
Yes movd are still vector path and one should be avoided.

>Now at a K7 2.127Ghz i'm going at 73MLN nodes a second generating speed
>after 1.e4,e5 2.d4,d5
>
>most of that is overhead due to general arrays that work both for black and
>white.
>
>So how many cycles is that realistically for DIEP at the opteron a node?
>
>How many for your kogge stone minimum.
>
>Which one will be faster?

We will see. Kogge Stone has some potential to do things massively parallel.
For several piece sets as well for several directions.

For pure movegen i would of course use a complete other desing.
But as we all know, that's not the main task.

>
>So why waste the effort to using bitboards?
>

Didn't we had this discussions before?
It is well known that this "natural" board representation don't fits to your
thinking patterns ;-)

Regards,
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.