Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: BitScan with reset - not so impressive with 3DNow!

Author: Matt Taylor

Date: 07:29:50 12/04/02

Go up one level in this thread


On December 03, 2002 at 07:00:37, Sune Fischer wrote:

>On December 03, 2002 at 06:40:36, Gian-Carlo Pascutto wrote:
>
>>On December 02, 2002 at 19:43:15, Gerd Isenberg wrote:
>>
>>>Congrats, Walter!!
>>>
>>>10-bit pattern		bsf	PI2FD	btr	c      LSB_64
>>>0x0000000011111133	15.3	18.0	19.1	22.8    17.8
>>>0x1010111010101110	19.7	18.5	19.6	23.4    17.8
>>>0x1111113300000000	20.6	18.0	19.1	22.8    17.8
>>
>>Does this mean that this is the fastest currently known
>>FindFirstAndRemove *and* that it doesn't need assembly?
>>
>>Impressive for sure!
>>
>>Any similar tricks for lightning-speed popcounting?
>>
>>What's the fastest sequence you've got for popcounting so
>>far? (preferably one that doesn't depend too much on new
>>instructions)
>>
>>--
>>GCP
>
>Yes very impressive.
>But how do one make sure the table remains in the cache?
>
>I think maybe in a real world senario things perform a little different than in
>a small loop, I suspect perft might be a little more realistic test for this
>one.
>
>-S.

If you access any part of the table, the CPU will store that part of the table
into its cache. The exact size depends on the size of a data cache line. I think
most CPUs use the LRU algorithm for cache discards, meaning the table won't
leave the cache until you access a large amount of data without accessing the
table again. The exact amount depends on the size of the caches, plural.

In P4, the L1 cache contains data that is already in the L2 cache, meaning that
the value is the size of the L2 cache, 512 KB for Northwood or 256 KB for
Williamette. I am not sure whether the Athlon is built the same way, but I don't
think it is. (Otherwise it wouldn't make any sense to have a humongous 64 KB L1
data cache.) Assuming I'm right, you have 320 KB on existing Athlon chips.

Now, to directly answer the question of real-world performance, that depends
only on what a chess engine does in between calls to the function. If the
function is usually called in quick succession, performance will be near the
synthetic test because the table and all of the bit scanning code remain in the
L1 caches. If the function is called sparsely, it will suffer.

Unless I'm completely wrong, the first case is more accurate for a chess engine,
isn't it?

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.