Author: Robert Hyatt
Date: 15:11:28 08/25/03
Go up one level in this thread
On August 25, 2003 at 18:00:49, Sune Fischer wrote: >On August 25, 2003 at 17:35:09, Robert Hyatt wrote: > >>On August 25, 2003 at 17:22:52, Sune Fischer wrote: >> >>>On August 25, 2003 at 16:50:08, Dan Andersson wrote: >>> >>>> The issue is the same. Because you can't guarantee that copying will be in >>>>cache. And you can't guarantee that other data structures won't be close or >>>>aligned in such a way that it won't trash the cache. The impact might not be >>>>great but it will be there. So the net cache bandwidth will be lower or even >>>>much lower than the simple linear relationship. Thus the slow main memory >>>>bottleneck will appear. >>> >>>For the whole picture goes, probably yes, but it is not easy to figure that >>>since there are many factors. >>> >>>I know this is down to hair splitting now, but IMO the reason that unmaking is >>>faster than uncopying isn't the one Bob gave, and I quote: >>> >>>" >>>>>I was thinking more about how silly it is to copy the empty bitboards for each >>>>>ply. If you update the boards that are active, they will stay in the cache. >>>>>Those that are not used might drop out, unless they are copied once every micro >>>>>second. >>>>> >>>> >>>>That is a reasonable rate for a program that searches 1M nodes per second. I'm >>>>going at 2.4M so make that about once every 400 nanoseconds. :) Suddenly it >>>>begins to add up in a big way. :)" >>> >>>As though the 2.4 Mnps was the reason. >> >>No. the 2.4M nps simply gives a frequency, roughly 400ns. Which is _my_ >>programs frequency on my dual 2.8ghz box. That gives me a _specific_ time >>per node, and it is pretty easy to estimate that copy/make is going to be >>a significant part of that... > >I'm not fond of this way of thinking, you don't actually do one node per 400 ns, >you do two nodes per 800 ns. That is exactly the same thing. That's the point. > >For one thing you method gets confusing if you want to e.g. count clocks per >node. Why? At 2.4M nodes per second it takes 2x the clock tick rate that it takes to run at 1.2M nodes per second, regardless of whether it is a dual doing two ticks per real tick, a super-scalar box doing more than one instruction per tick, or a dumb box doing one instruction per tick. > >>I was not saying that 2.4M nodes per second is the reason it fails for me, >>particularly. I simply said that I search a node per 400+ ns, which means >>I have to do a copy/make every 400+ ns. That's a lot of bandwidth. That the >>PC doesn't really have. > >But you have a "double" PC when talking cache bandwith... > So? If you double the clock rate of that single CPU, you _also_ have double the cache bandwidth... They are all increasing at about the same speed. SRAM fortunately doesn't have the huge latency of DRAM and nobody uses DRAM for cache. >>The dual actually makes this worse than a single cpu, as I said, due to two >>caches, snooping writes, and invalidating things in their own cache that the >>other processor just modified in the other cache. > >That is true, it is all _very_ confusing.. :) > >>It was about crafty and copy/make. As I said, if I run at 2.4M nodes per >>second, I have to do a copy/make every 400 ns. Whether I have one processor >>or 1024 processors, that won't change. > >Well if you have 1024 processors each processor only has to do 1/1024 of the >work on the numbers you post, so I'd say it does chance something. Yes, but it doesn't change the amount of time spent doing a copy operation every node Each processor takes 1024 times longer to do the copy, but with 1024 of them I do 1024 at a time. Net gain or loss is zero. > >> And, in fact, on the dual it is harder >>to do that than on a single because of snooping. > >yes, I do understand that :) > >>> >>>He gave numbers of 25%, nobody can confirm those numbers (I get ~10%), but I >>>figure now that he was talking 25% in SMP search, or what? >> >>No. Crafty version 9 was not SMP. The first SMP version was 15.0. The >>copy/make was dropped in version 9, and it produced a 25% speedup, no more, >>no less. For Crafty specifically. That's all I can say with any certainty. >>And I don't claim it is 25% for _other_ programs. Only that it was 25% for >>Crafty, and that was the _only_ change to the program. Going from Copy/Make to >>Make/Unmake. > >If this is true, I think you must have been trashing cache badly. I still probably do, in fact. On the PIV it is easier to do since the line size is now 4x longer. You only get 4096 cache lines in a PIV with 512KB. That is 4K different "chunks" of stuff. A hash probe makes one chunk. A second hash probe makes two chunks. You cycle through those 4K lines very quickly. Every copy/make takes two for my bitmap stuff. Not to mention all the attack table lookups (rotated bitmap lookups) and other stuff I'm doing... 4K is pretty small, when you think about it. And once it isn't big enough, you are looking directly at RAM bandwidth/latency, which is very bad from the processor's perspective. > >>I don't think there are _any_ differences between SMP and non-SMP in this >>regard, other than possibly SMP is worse, rather than being better as you >>suggested (dual caches, etc). > >Yes it didn't quite ended up sounding like I meant it :) > >-S.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.