Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: [Q] What is Genius' speed?

Author: fca

Date: 10:11:26 08/11/98

Go up one level in this thread


On August 11, 1998 at 11:23:15, Moritz Berger wrote:

>Posted by fca on August 11, 1998 at 08:45:55:
>>
>>In Reply to: Re: [Q] What is Genius' speed? posted by Tom Kerrigan on August 11, 1998 at 08:06:36:
>>
>>On August 11, 1998 at 08:06:36, Tom Kerrigan wrote:
>>
>>>Aside from software differences, the Pentium MMX/200 has a 66MHz L2 cache
>>>(possibly smaller than 512k) whereas the Pentium II/300 has a 512k 150MHz L2
>>>cache.
>>
>>Are you sure, Tom?  I thought the 1/2 speed clock for the L2 only applied to
>>P2's of 350MHz or faster (excepting Xeons, which are of course full speed
>>L2-ers).  If so, 100 MHz >> 66 MHz
>
>PII 2nd level cache speed frequency = core frequency / 2 for all PII (i.e.
>PII-233 2nd level cache at 116.5 Mhz)

Yup, I was wrong.  The bus switch from 66MHz to 100MHz is what coincided with
the release of the 350MHz.  As I am holding out for a 450, I've deliberately
been ignoring sysdoc this year to avoid temptation.  Too many little jumps waste
time and DMs.

I've familiarised myself with the Xeon, but conclude that by any sensible bang
for the buck measure it is best left in the box.  For example, I have little
doubt that on most/all chess programs, a P2/450 will better a Xeon/400 in an
otherwise CPU-unstressed environment.

>>>If a program really bangs on the L2 cache, it will go much faster on the
>>>Pentium II.
>>
>>Surely. Let us test that this is the cause for the differences blass reports.
>>
>>(a) Does it follow in broad terms that the higher the nps, the more hash
>>activity (but what about other tables?), therefore more L2-dependence (L1 deemed
>>to be too small to have too much influence)?
>
>Higher NPS -> higher memory throughput -> less cache efficiency (just imagine a
>program that uses e.g. 64 KB hash and "lives" completely in the 512 KB 2nd level
>cache of a PII in comparison with Fritz which fills up hundreds of megabytes of
>hash tables in a couple of minutes).

Put that way, seductive, Moritz. :-)

But this case you conveniently quote is of course most highly and outrageously
unrepresentative.  At important time controls (ssdf, 40/2 and similar) how many
programs only use <= 512Kb hash !  CST?

So I cannot tell one way or the other...

We are talking about programs that all consume significantly more than 512Kb
hash - some absurdly more (F5 seems like an engine to fill memory with some sort
of data! ;-) )

So *one* point is, will the asynchronous predictive processes that copy main RAM
into L2 (I am way out of date here - I assume things have got smarter and
smarter in the last 3-4 yrs) be time expensive enough to make the access time
for the L2 less important?

Say we have a program CG (Chess Guru) filling 5Mb hash and F5 filling 50Mb hash
at 40/2, and each program is running on one computer with 66MHz L2 (say my 166)
and also on one computer with 400MHz L2 (that Xeon you do not have ;-) ).  Both
sets of hardware are deemed to have equally clever predictive (what do I need in
the cache?) mechanisms, we also deem: pipelining effects, peephole
optimisations, 16/32 bit code effects etc, software techniques for accessing the
hash all to have no bearing.

So the QUESTION:

Will the speed ratio Xeon/CG : 166/CG be greater or less than Xeon/F5 : 166/F5 ?

I really think this is hard.

>>(b) Is the P2/300 : P200MMX ratio even higher for F5 (which I take it is
>>accepted is significantly higher in nps terms than J?)
>
>Junior 5 peaks out at slightly above 400kN/s on my PII400. So it's not
>"significantly" slower in terms of NPS than Fritz 5.
>
>>If answer for (a) is Yes and answer for (b) is No, the cause is liable to be
>>something else.
>
>Something else could be:
>a) 16 bit code vs. 32 bit code

Yup

>b) optimization for P5 pipelines (superscalar design, executing multiple
>instructions at once)

Yup. While of course CG is hand-crafted (assembler), I do not know about J5, but
like F5 I assume from its speed must have at least a handwritten core
(correct?).

A good trawl through compiled output shows lots of optimisations could be put
into even current compilers.  Leaving aside pipelining efficiency, the guys have
yet to suppress excessive redundant stack activity.  Compiler writers should now
realise with blazing hardware speeds that it matters little to the author if he
has to wait 10secs or 1min for the compile (cf 10 mins: 1 hr a decade ago).  So
have a quick compile mode, no clever optis, and alternatively a heavy-duty one
with many cycles....  I bet 5%-10% could be gained here.  C authors of the
world, wake up  ;-)

>c) segment boundary alignment (affects P5 and P6 not in the same way)

V. good point.

>d) branch prediction logic of the CPU, assumptions about this in hand crafted
>assembler code

As for b)

>e) use of certain "string" (i.e. affecting a sequence of bytes) instructions and
>preferred primitive data types (32 bit integers, 16 bit integers, etc.) - P5 and
>P6 speed optimizations are often mutually exclusive

Yup!!

>f) e.g. using the FPU to initialize hash tables - more than 2x faster on a P5
>than using integer moves, about as fast on PII, much worse on K6.

Progs could have a "what am I running on" bit at the beginning and do a quick &
dirty patch in of appropriate code for the time critical bits

>g) add your own favourite difference in processor architectures of P5 (Pentium,
>Pentium MMX) and P6 (Pentium Pro, PII)

>Saludos
>
>Moritz
>
>P.S: Even different steppings (read: releases) of the Pentium MMX exhibit
>massive differences in execution speed of certain commands (stepping 4 several
>times slower (on some commands) than stepping 3 comes to my mind), so that's
>another distracting factor we have to take into account.

Highly distracting.  Let us reason out my deemed case first.. :-)

Kind regards

fca




This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.