Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: [Q] What is Genius' speed?

Author: Robert Hyatt
Date: 17:03:28 08/11/98
On August 11, 1998 at 13:11:26, fca wrote:

>On August 11, 1998 at 11:23:15, Moritz Berger wrote:
>
>>Posted by fca on August 11, 1998 at 08:45:55:
>>>
>>>In Reply to: Re: [Q] What is Genius' speed? posted by Tom Kerrigan on August 11, 1998 at 08:06:36:
>>>
>>>On August 11, 1998 at 08:06:36, Tom Kerrigan wrote:
>>>
>>>>Aside from software differences, the Pentium MMX/200 has a 66MHz L2 cache
>>>>(possibly smaller than 512k) whereas the Pentium II/300 has a 512k 150MHz L2
>>>>cache.
>>>
>>>Are you sure, Tom?  I thought the 1/2 speed clock for the L2 only applied to
>>>P2's of 350MHz or faster (excepting Xeons, which are of course full speed
>>>L2-ers).  If so, 100 MHz >> 66 MHz
>>
>>PII 2nd level cache speed frequency = core frequency / 2 for all PII (i.e.
>>PII-233 2nd level cache at 116.5 Mhz)
>
>Yup, I was wrong.  The bus switch from 66MHz to 100MHz is what coincided with
>the release of the 350MHz.  As I am holding out for a 450, I've deliberately
>been ignoring sysdoc this year to avoid temptation.  Too many little jumps waste
>time and DMs.
>
>I've familiarised myself with the Xeon, but conclude that by any sensible bang
>for the buck measure it is best left in the box.  For example, I have little
>doubt that on most/all chess programs, a P2/450 will better a Xeon/400 in an
>otherwise CPU-unstressed environment.
>
>>>>If a program really bangs on the L2 cache, it will go much faster on the
>>>>Pentium II.
>>>
>>>Surely. Let us test that this is the cause for the differences blass reports.
>>>
>>>(a) Does it follow in broad terms that the higher the nps, the more hash
>>>activity (but what about other tables?), therefore more L2-dependence (L1 deemed
>>>to be too small to have too much influence)?
>>
>>Higher NPS -> higher memory throughput -> less cache efficiency (just imagine a
>>program that uses e.g. 64 KB hash and "lives" completely in the 512 KB 2nd level
>>cache of a PII in comparison with Fritz which fills up hundreds of megabytes of
>>hash tables in a couple of minutes).
>
>Put that way, seductive, Moritz. :-)
>
>But this case you conveniently quote is of course most highly and outrageously
>unrepresentative.  At important time controls (ssdf, 40/2 and similar) how many
>programs only use <= 512Kb hash !  CST?
>
>So I cannot tell one way or the other...
>
>We are talking about programs that all consume significantly more than 512Kb
>hash - some absurdly more (F5 seems like an engine to fill memory with some sort
>of data! ;-) )
>
>So *one* point is, will the asynchronous predictive processes that copy main RAM
>into L2 (I am way out of date here - I assume things have got smarter and
>smarter in the last 3-4 yrs) be time expensive enough to make the access time
>for the L2 less important?
>
>Say we have a program CG (Chess Guru) filling 5Mb hash and F5 filling 50Mb hash
>at 40/2, and each program is running on one computer with 66MHz L2 (say my 166)
>and also on one computer with 400MHz L2 (that Xeon you do not have ;-) ).  Both
>sets of hardware are deemed to have equally clever predictive (what do I need in
>the cache?) mechanisms, we also deem: pipelining effects, peephole
>optimisations, 16/32 bit code effects etc, software techniques for accessing the
>hash all to have no bearing.
>
>So the QUESTION:
>
>Will the speed ratio Xeon/CG : 166/CG be greater or less than Xeon/F5 : 166/F5 ?
>
>I really think this is hard.
>
>>>(b) Is the P2/300 : P200MMX ratio even higher for F5 (which I take it is
>>>accepted is significantly higher in nps terms than J?)
>>
>>Junior 5 peaks out at slightly above 400kN/s on my PII400. So it's not
>>"significantly" slower in terms of NPS than Fritz 5.
>>
>>>If answer for (a) is Yes and answer for (b) is No, the cause is liable to be
>>>something else.
>>
>>Something else could be:
>>a) 16 bit code vs. 32 bit code
>
>Yup
>
>>b) optimization for P5 pipelines (superscalar design, executing multiple
>>instructions at once)
>
>Yup. While of course CG is hand-crafted (assembler), I do not know about J5, but
>like F5 I assume from its speed must have at least a handwritten core
>(correct?).
>
>A good trawl through compiled output shows lots of optimisations could be put
>into even current compilers.  Leaving aside pipelining efficiency, the guys have
>yet to suppress excessive redundant stack activity.  Compiler writers should now
>realise with blazing hardware speeds that it matters little to the author if he
>has to wait 10secs or 1min for the compile (cf 10 mins: 1 hr a decade ago).  So
>have a quick compile mode, no clever optis, and alternatively a heavy-duty one
>with many cycles....  I bet 5%-10% could be gained here.  C authors of the
>world, wake up  ;-)
>
>>c) segment boundary alignment (affects P5 and P6 not in the same way)
>
>V. good point.
>
>>d) branch prediction logic of the CPU, assumptions about this in hand crafted
>>assembler code
>
>As for b)
>
>>e) use of certain "string" (i.e. affecting a sequence of bytes) instructions and
>>preferred primitive data types (32 bit integers, 16 bit integers, etc.) - P5 and
>>P6 speed optimizations are often mutually exclusive
>
>Yup!!
>
>>f) e.g. using the FPU to initialize hash tables - more than 2x faster on a P5
>>than using integer moves, about as fast on PII, much worse on K6.
>
>Progs could have a "what am I running on" bit at the beginning and do a quick &
>dirty patch in of appropriate code for the time critical bits
>
>>g) add your own favourite difference in processor architectures of P5 (Pentium,
>>Pentium MMX) and P6 (Pentium Pro, PII)
>
>>Saludos
>>
>>Moritz
>>
>>P.S: Even different steppings (read: releases) of the Pentium MMX exhibit
>>massive differences in execution speed of certain commands (stepping 4 several
>>times slower (on some commands) than stepping 3 comes to my mind), so that's
>>another distracting factor we have to take into account.
>
>Highly distracting.  Let us reason out my deemed case first.. :-)
>
>Kind regards
>
>fca


As I mentioned in another thread, hash tables have *zero* impact on cache
on a machine.  Consider one hash probe per position, with at *least* 2000
instructions per position, many of which also need one or more memory
operands as well.  So one hash probe that misses cache every 3,000 memory
accesses (most of which are in cache) really has no effect on the program's
performance at all...

Ergo, when you are analyzing cache performance, such as between xeon and
the normal core/2 pentium II's, ignore hash table sizes totally.  That is
way less than 1% of what is going on, and removing the hash table totally
won't make this better or worse...

Much more important is all the data that is pumped around more frequently,
such as the chess board which will get referenced hundreds of times in a
single node.  When you think about that, you realize just how unimportant
hashing is.  In fact, the early programs like chess 4.x stuffed this off
in *slow* memory, and put the instructions and search data into fast
memory, and it didn't affect their speed at all (this was called ECS on
the old CDC machines).

A real neat thing would be a hardware platfore where you can say "don't
cache this part of memory at all" because caching the hash table is totally
worthless, unless you can cache it *all*.  And each random probe does
dislodge a line of real data from cache, which is bad.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.