Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SURPRISING RESULTS P4 Xeon dual 2.8Ghz

Author: Vincent Diepeveen

Date: 09:30:40 12/17/02

Go up one level in this thread


On December 17, 2002 at 11:59:48, Matt Taylor wrote:

>On December 17, 2002 at 11:33:36, Vincent Diepeveen wrote:
>
>>On December 17, 2002 at 11:27:18, Matt Taylor wrote:
>>
>>>On December 17, 2002 at 10:10:46, Vincent Diepeveen wrote:
>>>
>>>>Hello,
>>>>
>>>>Some tests were performed in the USA, where some P4 Xeon dual 2.8Ghz
>>>>systems get delivered now. In Europe we can't get them yet and
>>>>most likely we don't want them either:
>>>>
>>>>Here are the results of DIEP at the Xeon 2.8Ghz dual ECC registered DDR ram.
>>>>
>>>>test 1: diep 4 processes. Of course HT enabled.
>>>>   181538 nps
>>>>
>>>>test 2: diep 2 processes. HT enabled.
>>>>   135924 nps
>>>>
>>>>test 3: diep 2 processes K7 1.6ghz (registered DDR ram all other settings
>>>>        identical to xeon dual setup):
>>>>   146555
>>>>
>>>>THE 2 TESTS NALIMOV DIDN'T OR COULDN'T WANT TO DO WITH CRAFTY
>>>>SOME WEEKS AGO REVEAL A BIG WEAKNESS OF HT/SMT:
>>>>
>>>>test 4: diep 2 processes. HT disabled.    171288 nps
>>>>
>>>>test 5 and 6: diep single cpu HT disabled and enabled were same speed
>>>>   92090  nps versus 92019 nps.
>>>
>>>Crafty gets better results with HT, but it's been optimized for HT. It just
>>
>>That hasn't been proven yet.
>>
>>there was no test done without HT and 2 processors as far as i know.
>>
>>Please read how i tested it.
>
>I'm pretty sure he did non-HT tests too.
>
>>>means you need a personal Intel engineer to make it blazing fast for people who
>>>plopped down $600 USD for a top-of-the-line Intel chip. Before long they'll
>>>start selling Intel engineers in local computer shops. Collect all 18...
>>
>>Crafty is doing 2 probes in 2 hashtables for example. Remove it and
>>improve it to 4 probes at 1 table (which is faster on both intel and
>>AMD anyway, but AMD profits more because its chipset is cheaper).
>>
>>>HT is a good idea, and it works in practice rather than just on paper. It just
>>>doesn't work for -everything-.
>>
>>in the factory they press 2 cpu's and put a single P4 sticker on it.
>>You pay a factor 2 more, but get something 11.4% faster. For databases
>>it was measured 11% rather than 11.4%.
>>
>>That's what i call a bad buy!
>
>CPUs since the Pentium have been pipelined. The goal is to spread the work out
>so you can get a throughput of at least 1 op/cycle. Not always possible,

1 op/cycle is very bad.

Already when the pentiumpro 200 existed the average C program ran at
1.76 instructions a clock.

So 1 iop/cycle is very bad then.

Considering that a K7 when calculated back to the pentiumpro200 is hell
of a lot faster for each processor clock, then you will realize clearly
that it's not worse than 1.76 now.

Note that the 1.76 number wasn't measured by me. I have no idea how i
measure this for DIEP otherwise i would know.

I see however that processors like McKinley which can do more instructions
a clock (6 a bundle) that it achieves way faster speeds for DIEP than
any other processor. No being 64 bits has nothign to do with that.

In contradiction to move a 64 bits word at the McKinley requires more
power than moving a 32 bits word at a K7.

Diep is a 32 bits program so if i compile 'int' then if the compiler makes
that a 64 bits instruction for the mckinley that's not my worry.

My happy feeling is the speed of it and it is doing very well. it is 33%
faster (a bad compiled executable with a cross compiler) for each Ghz of
a K7. 1.33Ghz K7 == 1.0 Ghz McKinley.

Now that's *without* branch optimization yet for the McKinley. I don't know
what speedup that will get for the mckinley but i compiled it for the
itanium1 do not forget that. i didn't optimize for itanium2 at all.

Nalimov probably has more details regarding this.

It is clear to me at least that this itanium2 is a big winner.

>particularly with complex instructions. Every CPU since then had adhered to
>superscalar designs.

>The Pentium 4 is no different. It has an extremely long pipeline to enable it to
>clock to higher frequencies.

Exactly what i fear yes.

> The bulk of this pipeline is shared for each
>"logical" CPU. They share caches, execution units, decoders, etc. The only thing
>that gets duplicated is the register set, a smaller part of the CPU.
>
>>>>First conclusion is that the system is profitting only from HT when you
>>>>use 4 processes at the same time, OTHERWISE IT IS A DISADVANTAGE IF
>>>>YOU MULTITHREAD, because see the big difference between 2 processes
>>>>running with HT turned on and off.
>>>>
>>>>In itself when you have a program with just 2 threads which you
>>>>run on a dual it gets slower. My assumption is that the hardware reports
>>>>4 cpu's and that the software doesn't care at what cpu to schedule
>>>>the processes/threads. the result of that is that there is a 33% chance
>>>>that things get scheduled at a cpu which is already running a thread/process.
>>>>
>>>>Resulting in a system where 1 cpu idles kind of shortly and 1 cpu is running
>>>>2 threads/processes.
>>>>
>>>>Actually the actual chance that the 2 processes are scheduled at
>>>>2 different processors (there is 4 processors for the OS
>>>>times 3 processors left for the second process is 12 different
>>>>schedulings) is: 8/12 = 2/3 = 66%. In short there is a disaster possibility
>>>>of 33%.
>>>
>>>Yes, when one thread is scheduled on one processor, there are 3 choices for the
>>>other thread, and one is disaster. 1/3 = 33%.
>>
>>>>Now the absolute speed from performance viewpoint. If the system idles
>>>>completely and then starts to run *exclusively* diep at 4 processors, then
>>>>the measured speedup as you can calculate is in the order of 11.4% for
>>>>SMT/HT.
>>>>
>>>>That's not so much actually. The loss by searching parallel is at most
>>>>parallel applications bigger than the win of 11.4%. In case of DIEP
>>>>i am on the lucky side and go for that 11.4% faster speed.
>>>>
>>>>Yet the sad confirmation is that the pessimistic expectation about the
>>>>absolute speed is completely confirmed. This system performs (assuming
>>>>lineair scaling) like a 1.98 Ghz dual K7.
>>>
>>>If memory is a big issue for Diep, it probably won't scale linearly as memory
>>>never does.
>>
>>It's a bigger issue for crafty than for DIEP. I hope you realize that
>>this diep version is from 25 august 2002, that beta version runs pretty ok
>>at cc-NUMA machines as well.
>>
>>Crafty doesn't though.
>>
>>>>there are motherboards now which do not require registered memory and
>>>>the K7 runs already quite a while at 2.0Ghz in fact. Now i don't care
>>>>for XP at all here nor do i care for the P4 at all. I just care for
>>>>parallel search here.
>>>>
>>>>If we know that a 2.0Ghz dual K7 is identical to a dual 2.8Ghz Xeon
>>>>and that in the majority of cases the K7 is going to win, then considering
>>>>the huge price difference, the choice would be trivial for most who
>>>>are looking for a lot of computing power for little money.
>>>
>>>AMD has always been better price/performance. Before the huge price differences
>>>in AMD and Intel chips, the AMD chips meant your old Socket 7 board could be
>>>used through ~500 MHz.
>>
>>>>Doesn't take away the fact that the P4 is winning ground. I remember
>>>>the first dual AMD 1.2ghz test versus P4 dual 1.7Ghz and the AMD dual
>>>>being 20% faster. Meaning in short that the speed of a P4 was performing
>>>>about 1 : 1.7
>>>>
>>>>Now if i compare a dual Xeon 2.8Ghz with a 2Ghz K7 then it's equal
>>>>meaning the P4 is performing 1 : 1.4
>>>>
>>>>So that's a big step forward!
>>>
>>>Well just about every application saw a similar gain from the 512 KB cache
>>>Northwood from the 256 KB cache Williamette. The new Xeons, as I understand,
>>>have 1 MB L3 cache in -addition- to the other caches. Don't quote me there. All
>>>I know is that things changed. The extra cache makes the P4 competitive whereas
>>
>>It's the DDR ram that speeded DIEP and crafty up a lot. Not the bigger
>>cache so much.
>>
>>DDR ram has nearly 2 times faster latency than RDRAM.
>
>You seem so sure, but you never tested a Northwood on RDRAM or a Williamette on
>DDR SDRAM to know.

I tested many P4s with RDRAM and they were very slow.

It is trivial that if the latency is 2 times slower that this has
a major impact onto things like hashtables (assuming the same
processor gets put in the machine, because obviously a cpu matters
way more than a bit faster ram).

It's like creating an obstacle.

there is eval hashtable. there is transposition hashtable. there is
pawn hashtables. etcetera in diep.

that doesn't fit *ever* in L2 cache at all.

A big L2 cache matters basically when you start getting parallel.

For a single cpu speed the L2 cache matters nothing at all.

The initial tests that i did rdram versus ddr ram indicated 1.7 versus 1.5

with SMT that's 1.4 now.

Best regards,
Vincent

>>>before P4 performance was something of an oxymoron, a joke among the people
>>>who'd seen its scores, and a disappointment for former Intel fans.
>>
>>>You'll probably observe the trend shift (not -completely-) toward the former
>>>when AMD releases Barton, likewise equipped with 512 KB of L2 cache.
>>
>>512KB is better than 256KB but i do not believe that the changing of just
>>the cache is going to improve the thing a lot. Getting it to 0.13 and
>>also clocking it at 3 Ghz will have more of an impact i bet.
>
>The size of the core doesn't affect performance directly. It affects how high
>the CPU gets clocked. The CPU can only do real work on the edges of a clock
>cycle. It doesn't matter how small it gets; if the CPU receives a 1 Hz clock,
>it's going to go 1 Hz, and that's pretty slow.
>
>In computationally-intensive applications, clock speed will yield linear
>increases in performance. However, you haven't posted results for Diep on a wide
>variety; you've only posted the four benchmarks. Little can be discerned except
>which system is faster. That yield no useful information about the architecture
>or how clock rate affects performance or how ram affects performance. There is
>no data.
>
>>>>Whether the step is because of DDR ram versus the very bad performing
>>>>RDRAM (nearly 2 times slower latency) is a matter of open discussion.
>>>>
>>>>HT/SMT in itself is not so impressing now.
>>>>
>>>>It's trivial to say that it will get impressive when the P4 can split itself
>>>>into 2 real processors having little dependencies on each other.
>>>>
>>>>Right now the single cpu win on a P4 3.06Ghz HT (18%) is
>>>>clearly more than the older generation 2.8 Ghz HT/SMT. so it seems
>>>>also this technique is slowly winning in realism.
>>>>
>>>>Right now i can't take what's getting on the market now very serious.
>>>>
>>>>Best regards,
>>>>Vincent



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.