Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: copy cost

Author: Robert Hyatt

Date: 08:32:10 08/25/03

Go up one level in this thread


On August 25, 2003 at 02:56:40, Johan de Koning wrote:

>On August 24, 2003 at 21:50:28, Robert Hyatt wrote:
>
>>On August 23, 2003 at 03:45:09, Johan de Koning wrote:
>>
>>>On August 22, 2003 at 10:45:11, Robert Hyatt wrote:
>>>
>>>>On August 22, 2003 at 02:53:06, Johan de Koning wrote:
>>>>
>>>>>On August 21, 2003 at 11:29:49, Robert Hyatt wrote:
>>>>>
>>>>>>On August 21, 2003 at 03:16:35, Johan de Koning wrote:
>>>
>>>[snip]
>>>
>>>>>>>Hence I dare to ask: 25% of what?
>>>>>>
>>>>>>NPS went _up_ by 25%+.  So total engine speed.
>>>>>>
>>>>>>This was changed in Crafty version 9.16, which dates back many years.
>>>>>
>>>>>Whoah! This is *very* hard to believe.
>>>>>There must have been something severely wrong with 9.15 then (continuing chache
>>>>>trashing comes to mind, but that's just guessing). More likely, this number does
>>>>>not come from a clean comparison of copy/make versus make/unmake.
>>>>
>>>>The _only_ change made was to replace copy/make with make/unmake.  Think about
>>>>the math.
>>>
>>>Thinking about the math is easy. Doing the math in order to get valid results is
>>>much harder since it requires facts to start with.
>>>
>>>>  Copy/make copies 256 bytes+.  Once _every_ node.  On today's
>>>>hardware, my dual xeon 2.8, I search about 2.4M nodes per second.  or about
>>>>400ns per node.  Copying 256 bytes is certainly going to show up on the radar
>>>>in a significant way, when it gets done once every 400 ns.
>>>
>>>To start simple, at 2.4 MN/s the average node takes 833 ns, or 2333 cycles.
>>>That's a fact. :-)
>>
>>OK..  Fact #1.  Your calculator is _broken_.  :)
>>
>>Enter 1 / 2400000 and hit the = button.
>>
>>You will get 417 nanoseconds.  You are off by a factor of two, somehow.
>
>Hey, I was only *playing* Vincent when I entered this thread! :-)
>Since I'm not really Vincent, there is no need to automatically state the
>opposite of what I write.

:)


>
>>>The next interesting fact is the time it takes to copy from cache to cache.
>>>Unfortunitaly, I don't know this fact, so doing the math stops here (while
>>>thinking about it continues :-).
>>>
>>>I just conducted a simple experiment on an Athlon Thunderbird 1333 MHz with my
>>>engine doing about 250 kN/s. Adding an unused copy (440 bytes) to the usual
>>>copy/make shows up as approx 3% in the sampling profiler (that is the single
>>>instruction repe movsd). Doing the math revealed that this 3% means about 180
>>>cycles per copy of 110 ints. Since I've heared that Athlon reads take 3 cycles,
>>>and I've heared long ago that K6 allowed 2 reads and 2 writes at the same time
>>>it does make sense.
>>>
>>>So I'll venture to say it is *almost* a fact that AMDs do blockmoves at a rate
>>>of 64 bit in 3 cycles. I'm still pretty factless though about blockmove speed of
>>>P{I...IV}, not to mention the next generations.
>>
>>The issue is what did you copy?  IE the same X bytes, or something like
>>this:
>>
>>struct blk array[64]
>>
>>and copy from array[i] to array[i+1] where i == ply???
>>
>>That has two effects.  More real memory traffic, and more cache line
>>displacements.
>
>Since I was only interested in cache to cache speed, I simply inserted
>/**/ dummy[d] = pos[d];
>just before the usual copy/make that starts with
>/**/ pos[d+1] = pos[d];
>

There is a danger in your test.  It is _possible_ that a good compiler
will recognize that code as "useless" and toss it out silently.  I see this
happening all the time with gcc and the Intel compiler I use under Linux.

If you do something like a=b, and a is not global, and it is not used anywhere
further down in the procedure, it will get eliminated.  If a is global, it has
to do the copy of course as the compiler can't tell who might use that value
in another procedure.


>Compared to your example of a parallel array (or inflated struct position)
>the C2C speed should be the same. Displacements look cheaper in my test, but I
>think they're the same on average (not tested, hence possible oversights on my
>end). Actually I think displacments are always cheap in a sensible chess program
>on 21th century machinery because most of the data is never dirty. TransWhatever
>tables are the obvious counter example, but they are (should be) typically
>restricted to small data and once per node.
>
>>>>  Yes, when you do
>>>>a copy, it will start off cache-to-cache.  But once you do that cache-to-cache
>>>>copy you are committed to eventually doing a cache-to-memory write-back.
>>>
>>>This is the final interesting fact.
>>>But also the most non-fact, since it depends almost everything.
>>>
>>>In my little experiment above, the extra copy took only 3%, but the actual run
>>>time went up 5.5%. This may not mean much because 1 extra line in main() can
>>>easily change the runtime by 1 or 2% (for reasons I haven't fathomed yet). It
>>>may also mean that data cache is actually getting trashed, and I'm lucky not to
>>>use large tables on a regular basis.
>>>
>>>... Johan
>>
>>I'm probably more dependent on cache with large bitmap tables to index all
>>the time for proper masks...  IE move generation, etc...
>
>I think you are, that's why I called it a non-fact.
>But maybe you should get rid of all those tables. :-)
>
>... Johan


Maybe you should add 'em.

:)




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.