Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Crafty profits little from Itanium and Opteron versus Commercials

Author: Vincent Diepeveen
Date: 20:30:36 08/07/03
On August 07, 2003 at 17:32:42, Gerd Isenberg wrote:

>On August 07, 2003 at 16:16:25, Sune Fischer wrote:
>
>>On August 07, 2003 at 15:40:33, Gerd Isenberg wrote:
>>
>>>On August 07, 2003 at 08:24:28, Sune Fischer wrote:
>>>
>>>>On August 07, 2003 at 08:15:08, Uri Blass wrote:
>>>>
>>>>>>Crafty is 64 bit prog, which means it's slow on 32 bit, even I have found that
>>>>>>doing a lookup is faster than shifting, I simply never do 1<<sq, I use a table
>>>>>>for that.
>>>>>
>>>>>I guess that it is only for 64 bits and if you have 32 bits number then it is
>>>>>better to do 1<<i when 0<=i<32 and not to use arrays.
>>>>>
>>>>>Correct?
>>>>
>>>>If you can do the shift in 1 clock, then you can't go any faster, but 64 bit
>>>>shifts are slow on old 32 bit chips so the table becomes faster.
>>>>
>>>>So for pure 64 bit you get fewer tables, faster and cleaner code.
>>>>
>>>>-S.
>>>>
>>>
>>>Hi Sune,
>>>
>>>Exactly! On the other hand, i believe that there is no need to use 64 bits
>>>everywhere, if 32 bits are enough. Using the standard six 32-bit register set is
>>>still fine with Opteron and one byte shorter opcode due to missing REX prefix.
>>>
>>>I don't know sizeof(int) in AMD64 compilers, still 4, or 8 per default.
>>>But of course there are explicite 32- or 64-bit types, signed as well as
>>>unsigned.
>>>
>>>I'm strained about what is the fastest 64-bitscan on opteron, specially if two
>>>scans should be done simultaneously e.g. to get a move from/to index:
>>>
>>>1. Matt Taylor's 64-bit mul with de Bruijn sequence.
>>>2. Folded 32-bit mul with Matt's super magic de Bruijn sequence.
>>>3. bsf, still vector path and 9 cycles.
>>>
>>>But i have to wait some time, until i can try it ;-(
>>
>>Well you're the expert, I just hope you post your findings here :)
>>
>>One thing I'm very interested in, is if floodfillers will be fast enough to
>>replace rotated. It would be nice if getting the bit wasn't needed, also to do
>>away with the incrementally updated occupied rotated boards.
>>What a "pure" code that would be :)
>>
>
>Not quite sure, Sune.
>
>So many promisting options with opteron including rotated ;-)
>
>What about this approach?
>
>Kogge-Stone propagators in MMX, generarors in sixteen 128-bit XMM, eg.
>simultaneously for default material:
>black:white rook1,rook2,queen as rook but white:black king as rook meta slider,
>black:white bishops,queen as bishop but white:black king as bishop meta slider.
>
>Two opposite direction parallel or interlaced. Pinned pieces or covered checker
>on the fly with some and/ors. Very easy unconditional stuff, 5 up to 8 or more
>independend MMX/SSE2-instructions in a row. There is only some const* source
>pointer (rsi) and target-structure/class pointer (rdi) for intermediate attack
>results for later eval and movegen/sorting SEE use. As well disjoint directions
>attacks and disjoint piece attacks.
>
>May be "en passant" and out of order some gp-register processing to keep the
>pipes really busy, some easy pawn or knight stuff in C.
>
>Of course there are several incarnations of this routine, eg. a more expensive,
>but general one for all cases of not usual material with more than a queen per
>side, more than two rooks, bishops or knights, more than one bishop on same
>colored squares, and even cheaper one e.g. for pawn/knight endings, where pins
>are not possible. One initial material dependent switch as the one and only
>condition here.
>
>Due to the amount of information, disjoint and aggregated output of these
>routines, a legal move generator based approach may outperform rotated,
>specially when double direct path sse2 instructions became single in the future.
>This routine is even a nice place to put a prefetch instruction before.
>
>Even rotated i consider fast as hell on opteron with 32KByte lookup or less.
>
>Gerd

you can already calculate a minimum number of cycles you lose to get
in a NORMAL register 1 move and then add the store penalty to it.

Now at a K7 2.127Ghz i'm going at 73MLN nodes a second generating speed
after 1.e4,e5 2.d4,d5

most of that is overhead due to general arrays that work both for black and
white.

So how many cycles is that realistically for DIEP at the opteron a node?

How many for your kogge stone minimum.

Which one will be faster?

So why waste the effort to using bitboards?

>
>
>>-S.
>>
>>>Cheers,
>>>Gerd
>>>
>>>
>>>
>>>>>Uri
Re: Crafty profits little from Itanium and Opteron versus Commercials Gerd Isenberg 00:00:01 08/08/03
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.