Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Sempron vs. Athlon 64: Proof that Crafty's working set is < 256k

Author: Robert Hyatt

Date: 07:46:13 08/21/04

Go up one level in this thread


On August 21, 2004 at 04:46:00, Sune Fischer wrote:

>On August 20, 2004 at 21:36:57, Robert Hyatt wrote:
>
>>On August 20, 2004 at 19:09:20, Sune Fischer wrote:
>>
>>>On August 20, 2004 at 13:03:28, Robert Hyatt wrote:
>>>
>>>>On August 20, 2004 at 12:05:59, Tom Kerrigan wrote:
>>>>
>>>>>On August 20, 2004 at 05:10:10, Tony Werten wrote:
>>>>>
>>>>>>On August 20, 2004 at 04:33:07, Tom Kerrigan wrote:
>>>>>>
>>>>>>>Now that AMD is selling two processors that are identical other than L2 cache
>>>>>>>size (Sempron has 256k, Athlon 64 has 512k) we have proof of Crafty's working
>>>>>>>set size:
>>>>>>
>>>>>>Or that it's >> 512
>>>>>>
>>>>>>Basicly, this only proves that it's not between 256 and 512.
>>>>>
>>>>>That's not how cache works. By that rationale, you might as well not have any
>>>>>cache at all. See my reply to Bob above.
>>>>>
>>>>>-Tom
>>>>
>>>>
>>>>Tom, that is _exactly_ how cache works for some reference patterns.
>>>
>>>I don't believe this pattern is an accurate model for a chess engine.
>>
>>Believe what you want.  I spent a fair amount of time making this so for Crafty
>>to improve cache utilization.  IE I _want_ to access memory serially to take
>>advantage of pre-fetching through cache line fills.  And I re-ordered chunks of
>>data to make this happen.
>
>But you tend to go over the same nodes again and again, i.e. when at the leaf
>and you do a two ply search, you will have a lot of evaluations of almost the
>same position and your tree stack will be going back and forth on the same tiny
>dataset.
>
>>>Then let's design a way :)
>>>
>>>For instance, say Crafty has a working set of 200 kB, try and add a dummy table
>>>of 1 kB to your working set by looping through this at every node.
>>>
>>>Now run the test again with increasing table sizes until you get hit with a big
>>>slowdown. That's when you start serious trash caching.
>>>
>>>Of course it assumes there's no trashing to begin with, but since Crafty is a
>>>pretty fast searcher I don't think there are any big cache issues.
>>
>>
>>"You don't think there are any big issues..."  "I _know_ there are".  I've
>>actually measured it, analyzed it, and worked on it both with folks at Microsoft
>>through Eugene, and through folks at AMD...
>
>There is trashing and then there is trashing.
>
>I think it matters if you do it at every node or only at every 10000th node.
>
>>move generation tables
>>extern signed char directions[64][64];
>>extern BITBOARD w_pawn_attacks[64];
>>extern BITBOARD b_pawn_attacks[64];
>>extern BITBOARD knight_attacks[64];
>>extern BITBOARD bishop_attacks[64];
>>extern BITBOARD bishop_attacks_rl45[64][64];
>>extern BITBOARD bishop_attacks_rr45[64][64];
>>extern BITBOARD rook_attacks_r0[64][64];
>>extern BITBOARD rook_attacks_rl90[64][64];
>>extern int bishop_mobility_rl45[64][64];
>>extern int bishop_mobility_rr45[64][64];
>>extern unsigned char bishop_shift_rl45[64];
>>extern unsigned char bishop_shift_rr45[64];
>>extern int rook_mobility_r0[64][64];
>>extern int rook_mobility_rl90[64][64];
>>extern BITBOARD rook_attacks[64];
>>extern POSITION display;
>>extern BITBOARD queen_attacks[64];
>>extern BITBOARD king_attacks[64];
>>extern BITBOARD king_attacks_1[64];
>>extern BITBOARD king_attacks_2[64];
>>extern BITBOARD obstructed[64][64];
>>extern BITBOARD w_pawn_random[64];
>>extern BITBOARD b_pawn_random[64];
>>extern BITBOARD w_knight_random[64];
>>extern BITBOARD b_knight_random[64];
>>extern BITBOARD w_bishop_random[64];
>>extern BITBOARD b_bishop_random[64];
>>extern BITBOARD w_rook_random[64];
>>extern BITBOARD b_rook_random[64];
>>extern BITBOARD w_queen_random[64];
>>extern BITBOARD b_queen_random[64];
>>extern BITBOARD w_king_random[64];
>>extern BITBOARD b_king_random[64];
>>extern BITBOARD stalemate_sqs[64];
>>extern BITBOARD edge_moves[64];
>>extern BITBOARD enpassant_random[65];
>>extern BITBOARD castle_random_w[2];
>>extern BITBOARD castle_random_b[2];
>>extern BITBOARD wtm_random[2];
>>extern BITBOARD clear_mask[65];
>>extern BITBOARD clear_mask_rl90[65];
>>extern BITBOARD clear_mask_rl45[65];
>>extern BITBOARD clear_mask_rr45[65];
>>extern BITBOARD set_mask[65];
>>extern BITBOARD set_mask_rl90[65];
>>extern BITBOARD set_mask_rl45[65];
>>extern BITBOARD set_mask_rr45[65];
>>
>>
>>Those are just a few random tables,
>
>They aren't random at all, that's my point.

That has been _my_ point.  I run through a _lot_ of such tables.  And that tends
to flush cache before something gets reused.  A random probe into anything
replaces N bytes (a cache line).  IE on my PIV that is a 1 byte access replaces
128 bytes of cache in one chunk.  On my dual xeons, that is 4096 cache lines.
IE 4096 random accesses can completely flush cache.  And I only accessed 4096
_bytes_ to do that.




>
>Take for instance b_knight_random, I bet you can easilly go 10000 nodes without
>having to access b_knight_random[0]. And when the knights are off you don't need
>them at all.


See above...



>
>> not counting the actual stuff in my TREE
>>structure that gets "swept over" many times in a search..
>
>I guess it doesn't matter much, as long as you don't do it at an ultra high
>frequency.
>Maybe there is a cache miss when going from ply 20 to ply 0, but the search
>doesn't do that very often, most of the time is spent going from ply 15 to 20 or
>some such.


Again you have to understand cache and how lines get filled.  All I need to do
is access 4K different things, which is quite easy, and I just blew the old
contents out of cache.  I believe the opteron I used had 64 byte lines, and 1M
L2.  That turns into 16,384 cache lines.  256K bytes on an AMD-64 would be back
to 4K lines as the lines are 1/2 as long as the PIV.

512K may _sound_ big, but it isn't unless you really take care to maintain
reference locality.  That is one of the recent changes I made for the opteron to
limit the problems with the MOESI cache coherency stuff.  The idea of a
program's working set fitting into cache is not really sound, if you talk about
a 512K working set and 512K of cache.  The chance of that "perfect" mapping of
working set to unique cache lines is very low.





>
>>bitboard programs are not very small.  But the disadvantages of size are offset
>>in other ways, particularly as cache gets bigger and bigger.  This ignores
>>hashing and repetitions and everything else.
>
>You're ignoring the facts here, it seems that larger caches (today) has no
>influence on performance even for Crafty.



Who is ignoring that fact?  I reported that when I first looked into buying my
first quad xeon, I benchmarked Crafty on the PII xeon with 512kb, 1024kb, and
2048kb of L2 cache.  1024K was 10% faster.  2048 was another 7% faster.  Eugene
ran crafty on IA64 with 1.5mb L2 and 3.0mb L2 and found 3.0mb was 10% faster.

Who is ignoring what?

Tom ran on 256 and 512K and concluded the working set for crafty was < 256K.  I
simply said the data doesn't support the conclusion.  The conclusion _may_ be
right.  But two of us ran on larger L2 boxes and got better performance.  One
did not.  Two of us ran up to 2048K, one only tried 256 and 512k.




>
>>testing says bigger hashes help.  At least for the testing I reported, and for
>>the test results Eugene reported.
>
>You ran a test but you did not isolate the problem first so what you saw does
>not interpret easily.


What "problem" did I not isolate?  I simply ran exactly the same program, on
exactly the same processors, but with three different cache sizes.  I measured
the difference in speed and since the only difference was cache size, the speed
difference had to be attributed to that..



>
>I think at some point you begin to cache part of the hash. Since you revisit the
>same nodes again and again in iterated deepening, if you could cache some of
>that (not just hash it but also cache it!) it might give you a measurable
>speedup.


I don't think cache hits on TTable entries are very likely.  Again, the cache
doesn't cache random bytes.  It caches consecutive chunks of bytes.  That makes
_all_ the difference in the world when it comes to hashing.  That is why I
modified my hash stuff to put the double-probe entries in consecutive memory
addresses, to take advantage of cache line fill that prefetches data around the
first entry probed.




>
>With caches that large there is an effect from this I'm sure, how big it is I
>have no idea but it makes it hard to say what you're measuring.
>

Caches are not that large.  Again 4K entries on my PIV xeon doesn't strike me as
"large".  It is pretty large for consecutive memory addressing, if the operating
system does its job in laying out the program in physical memory to minimize
cache line aliasing, but I don't do a lot of sequential reading.  I access here
and there all over the place...





>>I can't say more but I can say what I have actually _measured_.
>
>You saw what you say yes, but it's your interpretation of the experiment I
>question.



Give me a break.  There is but _one_ interpretation of the data I presented.
Bigger cache improves performance measurably for Crafty.  What other
interpretation is possible from the data either I or Eugene have observed?




>
>-S.



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.