Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: BitScan with reset - not so impressive with 3DNow!

Author: Matt Taylor
Date: 09:03:07 12/04/02
On December 04, 2002 at 10:39:34, Sune Fischer wrote:

>On December 04, 2002 at 10:29:50, Matt Taylor wrote:
>
>>On December 03, 2002 at 07:00:37, Sune Fischer wrote:
>>
>>>On December 03, 2002 at 06:40:36, Gian-Carlo Pascutto wrote:
>>>
>>>>On December 02, 2002 at 19:43:15, Gerd Isenberg wrote:
>>>>
>>>>>Congrats, Walter!!
>>>>>
>>>>>10-bit pattern		bsf	PI2FD	btr	c      LSB_64
>>>>>0x0000000011111133	15.3	18.0	19.1	22.8    17.8
>>>>>0x1010111010101110	19.7	18.5	19.6	23.4    17.8
>>>>>0x1111113300000000	20.6	18.0	19.1	22.8    17.8
>>>>
>>>>Does this mean that this is the fastest currently known
>>>>FindFirstAndRemove *and* that it doesn't need assembly?
>>>>
>>>>Impressive for sure!
>>>>
>>>>Any similar tricks for lightning-speed popcounting?
>>>>
>>>>What's the fastest sequence you've got for popcounting so
>>>>far? (preferably one that doesn't depend too much on new
>>>>instructions)
>>>>
>>>>--
>>>>GCP
>>>
>>>Yes very impressive.
>>>But how do one make sure the table remains in the cache?
>>>
>>>I think maybe in a real world senario things perform a little different than in
>>>a small loop, I suspect perft might be a little more realistic test for this
>>>one.
>>>
>>>-S.
>>
>>If you access any part of the table, the CPU will store that part of the table
>>into its cache. The exact size depends on the size of a data cache line. I think
>>most CPUs use the LRU algorithm for cache discards, meaning the table won't
>>leave the cache until you access a large amount of data without accessing the
>>table again. The exact amount depends on the size of the caches, plural.
>>
>>In P4, the L1 cache contains data that is already in the L2 cache, meaning that
>>the value is the size of the L2 cache, 512 KB for Northwood or 256 KB for
>>Williamette. I am not sure whether the Athlon is built the same way, but I don't
>>think it is. (Otherwise it wouldn't make any sense to have a humongous 64 KB L1
>>data cache.) Assuming I'm right, you have 320 KB on existing Athlon chips.
>>
>>Now, to directly answer the question of real-world performance, that depends
>>only on what a chess engine does in between calls to the function. If the
>>function is usually called in quick succession, performance will be near the
>>synthetic test because the table and all of the bit scanning code remain in the
>>L1 caches. If the function is called sparsely, it will suffer.
>>
>>Unless I'm completely wrong, the first case is more accurate for a chess engine,
>>isn't it?
>
>And that is the question! :)
>
>If you use it only for move _generation_ then you might have makemove and eval
>plus search stuff in between. Besides, for everything you store in L1 you have
>to lose something else, which might hurt just as much.
>
>IMO, table lookups always perform well when used extensively in small tight
>loops, so I don't think it is the right way to test it.
>
>Let me give you an example:
>Popcount is really fast if you test it like that using a [256] table, but it's
>even faster if you use a [65536] size, so which one would you pick for a real
>world program? :)
>
>-S.
>
>>-Matt

You are right, but his table is small (154 bytes). Two Athlon cache lines are
128 bytes (85.3% of the table). At worst, it would be 3 cache lines (192 bytes =
0.06% total data cache size). On a P4, I believe the cache line size is also 64
bytes. (On 6th gen. Intel cores -- P Pro, P2, P3 -- it is 32 bytes.)

If you think it would be detrimental to performance, I could compare the speed
of the algorithms assuming the cache is flushed between iterations on the large
loop. (Flushing cache between iterations of the inner loop, the one that locates
all bits and resets them, is not accurate for real-world, I think.)

Even if you do stuff in between the pop counts and bit searches, the point is
that, overall, execution is punctuated with sparse segments of heavy access. If
it is, then the table will be cached efficiently.

As to whether or not the table hurts overall performance by kicking out other
needed data in the cache, the answer lies in how much data the chess engine
needs. If it needs a lot of data, then I would answer that the table-based
solution is universally good because the chess engine will -already- be throwing
things out of the cache on a continual basis.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.