Author: Vincent Diepeveen
Date: 04:34:00 05/31/01
Go up one level in this thread
On May 30, 2001 at 10:40:57, Robert Hyatt wrote:
>>Also gnuchess is considering pieces to be general, in crafty you have
>>written out code for *every* piece.
>
>
>What does this have to do with anything? I simply wanted to know "How much
>more does crafty get from a 64 bit machine than a 32-bit program?" This
If that machine like a SUN will be not giving much penalty to
branch mispredictions, then my NPS will get enormeously bigger,
whereas for crafty it matters not so much.
The only successful 2 cpu's that are 64 bits are IMHO:
- alpha
- sun
The first you are lucky that you have way less branches (or
more fall through branches) as i have.
alpha happens to be not only a machine with the right instructions at
it for you, it also has HUGE mispredicted branch penalties.
So that slows me down.
But let's take the SUN or MERCED.
Both on paper have very small misprediction penalties for branches.
Merced on paper has nothing.
How would crafty do on merced compared to DIEP, if both 64 bits
native compiled (so all int's also to 64 bits because the
32 bits execution units at merced suck bigtime, 3 times slower
as P3 at same speed).
>experiment answered that perfectly for the case of GNU vs Crafty. I doubt
>your speed differential will be any different on the two machines than that
>of GNU.
Diep has a much bigger code size as gnuchess, so in my case i always
profit from cpu's with bigger L1 caches and or fast L2 caches.
More as any other program in fact.
Note that many datastructures in gnuchess are 8 bits, which is hell slow
on alpha. I have nothing 8 bits.
Though that's slower at 32 bits cpu's i have chosen for getting
completely 32 bits to no longer suffer from casting bugs of the
compiler or any programming mistake which would trigger a casting bug.
So gnuchess is not a good compare in fact.
But definitely diep suffers from the bigger penalties that alpha cpu
has for branch misprediction, just as well as gnuchess suffers from it.
>
>
>>
>>So the compare is impossible. You can design much faster loops and
>>arrays as gnuchess has, and if one is doing it in assembly it's even
>>faster.
>
>
>There was no assembly in the alpha version of crafty, so your comment makes
>utterly no sense. Whether GNU is a "good" 32 bit program or not is also
>completely irelevant. The question was how much faster would it run on the
>alpha? The answer was given above. Ditto for crafty, which _is_ designed
>around a 64 bit word. I'll be happy to comple Diep and compare it to crafty
>on my alpha if you want.
There happens to be one cpu that is not favouring all my branches,
but is favouring crafty BECAUSE your evaluation is so simplistic.
However that's also at the same time the weakness of crafty, so you
present here your weakness as being your strong point?
This where my weak point is very clear from programming viewpoint
- loads of branches that get mispredicted, and also the pathetic
2048 entries in the alpha processor (note PII had 512 and K7 has 2048)
is no match for 460kb CODE size.
>The numbers won't be any different.
>>>That is pretty significant. Now whether someone can beat my move generator
>>
>>That's comparing a bicycle which is made to show the advantage of the
>>wheel with a propellor airplane.
>>
>>I would rather want to compare the propellor airplane with a Jet engine.
>>
>>>with an 0x88 (or something better) is a question. Whether they can beat my
>>>evaluator using 0x88 is another question. But it is pretty obvious that on a
>>>PC they had better beat me by a factor of two, or I will catch up on the
>>
>>2.2 to be exact.
>>
>>>64 bit machines. And then some of the bitboard wizadry comes in handy. such
>>
>>Yes i'll beat you on a SUN too, no problem.
>
>
>Anytime you want me to run the test, feel free to send me your code. We have
>64 bit sparcs here, I have an alpha. I can probably run the comparison on a
>64 bit MIPS box too. I already know the answer, however.
If i can remote logon i can do it myself, i'm not happy to give away
my source, even though i know you wouldn't misuse it i distrust the
internet.
I know how to compile for SUN and alpha with diep :)
>
>A 64 bit program simply runs faster when ported to a 64 bit machine, than a
>32 bit program does when ported to the same machine. _every_ time.
but you first slow it down 3 times or so in order to get 2 times faster
on a 64 bits machine.
It's like the Rainer Feldmann speedups at 512 processors. First he
slows down his program factor 10 to get 50% speedup as he compares with
a slowed down version, instead of reporting a 10% speedup when comparing
to a non-slowed down version.
The problem here is that by revealing source code i lose my code!
But let's do an easy example.
Because it's no problem to compare assembly here which compiler
generates.
I'll do that in a new thread called:
'evaluation and bitboards'
i'll first download the lastest crafty
source. only have 18.1 here or so.
>
>
>>
>>But let's ask you how you plan to do mobility on an alpha,
>>instead of the rude summation you're doing now!
>
>
>
>_SAME_ code as right now.
>
>
>>
>>And how you plan to use everywhere in the evaluation attacktable
>>information using the slow attack function currently in use
>>for crafty!
>
>
>First, my attack tables are not slow. They are simple direct memory lookups.
>I use exactly the same code on any architecture.
>
>>
>>Where in my attacktables i can directly see how many attackers are
>>at a square with one AND instruction of an array within L1 data cache
>>and a constant value:
>> (MyAtt[sq]&0x0ff)
>>
>>I'm using that loads of times in my evaluation. Also whether some
>>square is attacked anyway by my opponent:
>>
>> OpAtt[sq]
>
>
>Vincent, I know you have a short attention span. But please stick to the
>topic. The issue here is "how does a 32 bit program perform on a 32 bit
>processor, and then when moved to a 64 bit processor, compared to how does a 64
>bit program perform on a 32 bit processor and then on a 64 bit processor?"
>
>You want to change the question. You can't. I will happily run my program and
>your program on both a 32 bit machine and on a 64 bit machine and compare the
>speedups. Mine will be larger. Every time.
>
>
>
>
>>
>>That would be pretty interesting to get fast in bitboards too, but
>>i can already tell you, IT'S IMPOSSIBLE to do it quick!
>
>
>This from the same person that said "using message passing is impossible to
>get a speedup." Or "It is impossible to win at chess with a quiescence search
>that only does limited captures." Or any of other several "impossible" things.
>Later, you learn how to do them of course.
>
>
>
>
>
>>
>>Where bitboards are good in are a few things which hardly get used,
>>like some complex pawn structures:
>> if( (MyPawn&0x0a00000010001000) == 0x0a00000010001000 )
>>
>>So you can detect for a certain side at the same time several pawns,
>>which otherwise is slower:
>> if( quickboard[sq_a2] == whitepawn
>> && quickboard[sq_b2] == whitepawn
>> && quickboard[sq_c2] == whitepawn )
>>
>>However how many programs are there except mine that use loads
>>of complex pawnstructure in evaluation?
>
>Mine, for example. I even compute all the squares all pawns can move to,
>and use that in the evaluation.
>
>
>
>
>>
>>Now you'll say that you can do other things quick too, like detecting
>>whether a file is empty and such things. However there are good
>>alternatives in 32 bits that can be seen as a bitboard too, which
>>i happen to use in DIEP.
>
>
>
>There are other "good" alternatives. But you are _still_ using 1/2 of an
>alpha. _that_ is the point.
>
>
>
>>
>>So when in 2010 everyone can buy his own 64 bits machine,
>>then what i might do is i get a 64 bits machine and
>>add within 1 week 2 bitboards of 64 bits for pawns to DIEP!
>>
>>So i'll do the conversion at the time i have such a machine!
>>
>>To be a factor of 3 slower now on 99% of all computers on the world
>>(32 bits cpu's) that's not my favourite thing to do!
>
>
>I'm not a factor of 3 slower. That is your imagination working. Somethings
>are faster in bitmaps, some slower. You want to harp on the move generation.
>That is less than 10% of my execution time, so it doesn't count. Evaluation
>is very good in bitmaps.
>
>
>
>
>>
>>Right now i do manage with some 32 bits equivalents...
>>
>>The only 64 bits values i now have in diep are getting used for
>>node counts... :)
>>
>>>as very simple tests to answer questions like "here is a bitmap of my passed
>>>pawns, do I have an outside passer on one side, or do I have an outside passer
>>>on both sides of the board?" Ditto for "here is a bitmap of my candidate
>>>passers, ..." Right now, on the PC, those operations are pretty expensive
>>
>>As i mentionned this can be done very fast with good alternatives of
>>32 bits which are in fact FASTER as 64 bits alternatives.
>>
>>That is, at 32 bits machines!
>>
>>>and dig into the advantage of bitmap evaluations. But on the 64 bit machines,
>>>those operations lose _all_ of their penalties, and begin to look pretty good.
>>
>>Diep suffered bigtime when i converted all my 8 bits stuff to 32 bits,
>>because it all occupied more space
>> - more code size
>> - more data size
>>
>>Diep became about 5% slower, and i need to mention that i didn't optimal
>>profit from 8 bits datastructures. The real penalty on P3 processors would
>>be around 10%. At K7 about 50%.
>>
>>So for 32 bits to 64 bits *everything* that's getting 64 bits which first
>>was 32 bits occupies more size. So i start losing like 10% to *start* with...
>>
>>Unless there are very GOOD reasons to get to 64 bits in the future i'll stick
>>to 32 bits for quite a long time with at least 99% of the code!
>>
>>As 99% of all applications will be 32 bits hell sure processors will
>>take care they are very fast with 32 bits code, so no problems there either!
>>
>>there are quite some problems to convert windows applications to 64 bits
>>because 'int' in general is seen as a 32 bits thing!
>>
>>>I wouldn't suggest that anyone rewrite their working code. But I made the
>>>decision to do this several years ago. I haven't found it a disadvantage on
>>>32 bit machines, and I really doubt it will be a disadvantage on 64 bit
>>>machines.
>>
>>factor 2.5 at 32 bits machines for sure.
>>
>>>you have to think "data density". Which is what made our move generator and
>>>some evaluations on the Cray so very fast. But not until you start "thinking
>>>outside the box" and asking "how can this architecture help me do things more
>>>efficiently?" rather than "How can I make my program run on this architecture?"
>>
>>>Those are two vastly different questions. With vastly different answers.
>>
>>Here we all agree, but for me putting things in general in a bitboard
>>means i have 1 bit worth of information about something. That's too little
>>information for me!
>>
>>So if i ever go use 64 bits then it's for those parts of my program where
>>i can profit.
>>
>>What i still do not understand is why crafty is so slow on 64 bits sun
>>machine. Everything is 64 bits there. For me a sun processor performs
>>the same as a PII processor at the same speed.
>>
>>How is crafty performing at it?
>>
>>IMHO the only reason why crafty isn't so slow on intel processors is
>>because you have way less branches as i have in DIEP.
>>
>>I am using complex patterns in DIEP's evaluation function, so everywhere
>>there are branches. In crafty the branches that are there are easier
>>to predict.
>>
>>So at 64 bits processors where the branch misprediction penalty isn't
>>big, there crafty will perform very bad compared to DIEP.
>>
>>At an alpha the branch misprediction penalty is HUGE.
>>
>>So at a 21164 processor which had only 8kb L1 cache i did very bad with
>>DIEP. At a 4 processor 21264 processor DIEP probably is doing worse as on
>>a dual K7 1.5Ghz palomino which gets released next month hopefully.
>>Operating at 1.5Ghz.
>>
>>Now the branch misprediction penalty of the K7 isn't very good. But
>>the 21264 it's way worse. What 21264 has to prevent mispredictions
>>a bit is a clever system of 2 killertables that reference each other.
>>
>>However the size of those tables is so small compared to the number of
>>branches in DIEP, that it won't speed me up a single %!
>>
>>A 633Mhz 21164 processor performed for DIEP as a 380Mhz PII.
>>
>>Now that was some time ago of course. Nowadays DIEP would do worse on
>>that 21164 when compared to the PII, because code+data size has become
>>bigger.
>>
>>A similar thing is what i have in mind for the 21264. Where on paper
>>it can do very good because it does 4 instructions a clock, it's very
>>likely that a K7 of same Mhz is completely outgunning it for me.
>>
>>For crafty the 21264 is not so bad as it is for me,
>>because you relatively suffer less
>>from the HUGE branch misprediction
>>penalty as i do with DIEP!
>>
>>It's not that the 64 bits datastructure is so good at 64 bits, the
>>problem is that the 21264 design has a serious problem with branches!
>>
>>Best regards,
>>Vincent
>>
>>
>>
>>Best regards,
>>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.