Author: Roberto Waldteufel
Date: 18:53:20 10/31/98
Go up one level in this thread
On October 31, 1998 at 20:57:09, Robert Hyatt wrote:
>On October 31, 1998 at 19:45:28, Roberto Waldteufel wrote:
>
>>
>>On October 31, 1998 at 10:44:25, Robert Hyatt wrote:
>>
>>>On October 31, 1998 at 07:27:43, Bo Persson wrote:
>>>
>>>>On October 29, 1998 at 13:20:45, Roberto Waldteufel wrote:
>>>>[snip]
>>>>>Of course, once I started using assembler on the intel CPU, the bsf and bsr
>>>>>instructions were a god-send and I stopped all this nonsense with modular
>>>>>arithmetic....:-)
>>>>>
>>>>>Best wishes,
>>>>>Roberto
>>>>
>>>>I thought these instructions were microcoded and extremely slow on anything
>>>>newer than a 386! When testing on a Pentium, a table lookup was actually much
>>>>faster for me.
>>>>
>>>>
>>>>Bo Persson
>>>>bop@malmo.mail.telia.com
>>>
>>>
>>>ditto here. But I only tested this on a P5 (non-mmx) when I started doing this
>>>stuff. the instructions were grossly slow. Someone even tried some hand-coded
>>>stuff to help and it was still slower than what I do at present (table lookup).
>>
>>I don't know if these instructions are microcoded or not, but if you guys both
>>say so then I guess they probably are. However, I just ran some tests with a
>>rather surprising outcome. I have recently developed a rotated bitboard
>>implementation for diagonal moves with a view to upgrading my program from
>>ordinary bitboards. Having got it working, I was interested to see how fast it
>>was, so I got it to generate attacks for 10 seconds (sequentially doing all 64
>>squares in turn repeatedly) and calculated the number of maps generated per
>>second, averaged out over several runs. I used a PII 333MHz (with MMX), and I
>>got almost exactly 400,000 maps per second. Then I tested my old assembler coded
>>method (using lots of bsf and bsr instructions!) for generating the same thing.
>>Since it was easy to do at the same time, I tested also rook and queen map
>>generations using this assembler method. In each case I had to insert two extra
>>instructions (pushad and popad) on account of the loop within which the code
>>ran, so the real values would be very slightly faster than what I measured:
>>
>>Bishop attack maps: 580,000 maps per second
>>Rook attack maps: 560,000 maps per second
>>Queen attack maps: 510,000 maps per second
>>
>>So the assembler version with the bsf and bsr instructions outperformed the
>>rotated bitboard generator by over 25%. I have not tested rooks and queens with
>>rotated bitboards, but I suspect the rooks would be similar to the bishops, and
>>the queens would take double the time using rotated bitboards, whereas my method
>>shows only a slight degradation in speed going from bishops or rooks to the
>>omnidirectional queen. I conclude that either
>>
>>a) the assembler version is very efficient, or
>>
>>b) I screwed up somehow in my rotated bitboard implementation.
>>
>>I am quite ready to believe that (b) was the culprit, but even so, I think the
>>assembler routine is still pretty fast. Has anyone else tested this sort of
>>thing recently? I don't really know how much of a difference the PII technology
>>makes, but unless I can speed up the rotated bitboard method I will stick to the
>>assembler routine. Of course, whichever method is faster, I could still ditch
>>the incrementally updated attack maps and generate on the fly like Crafty does.
>>
>>If bsf and bsr really are microcoded, would I not get much lower counts? Or
>>maybe I ought to be getting even higher than I am? I don't know exactly how fast
>>is fast in this context.
>>
>>Best wishes,
>>Roberto
>
>
>note the architecture I said I tested on: the original P5 with no mmx, no
>fancy tricks or anything. The P6/PII has a totally different cpu core, and
>the bsf/bsr might be much faster in them. I have not tested this since the
>P6/PII came out... but I will now...
Hi Bob - I looked it up, and yes, the bsf and bsr instructions used to be very
slow, but on PII and PPro they each execute in 1 or two clocks. Before it was
something like 11xthe number of zeroes that had to be skipped, which would be
ghastly. This probably accounts for why my code runs so quickly. I was expecting
the rotated bitboard method to turn out faster, and I did several runs just to
convince myself that it was not some quirky fluke.
My source for the timing of bsf and bsr way the Pentium Optimization manual by
Agner Hog, which I downloaded today after my compiler manufacturer recommended
it in response to a query I made on their BBS. It's fascinating stuff if you
want to make the most of assembler on the Pentium, and goes into lots of detail
about cache, pipelineing, stalls, instruction pairing, MMX instructions and tons
of other stuff - I highly recommend it!
http://announce.com/agner
Best wishes,
Roberto
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.