Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SURPRISING RESULTS P4 Xeon dual 2.8Ghz

Author: Matt Taylor

Date: 10:14:56 12/17/02

Go up one level in this thread


On December 17, 2002 at 12:30:40, Vincent Diepeveen wrote:

>On December 17, 2002 at 11:59:48, Matt Taylor wrote:
>
>>On December 17, 2002 at 11:33:36, Vincent Diepeveen wrote:
>>
>>>On December 17, 2002 at 11:27:18, Matt Taylor wrote:
>>>
>>>>On December 17, 2002 at 10:10:46, Vincent Diepeveen wrote:
>>>>
>>>>>Hello,
>>>>>
>>>>>Some tests were performed in the USA, where some P4 Xeon dual 2.8Ghz
>>>>>systems get delivered now. In Europe we can't get them yet and
>>>>>most likely we don't want them either:
>>>>>
>>>>>Here are the results of DIEP at the Xeon 2.8Ghz dual ECC registered DDR ram.
>>>>>
>>>>>test 1: diep 4 processes. Of course HT enabled.
>>>>>   181538 nps
>>>>>
>>>>>test 2: diep 2 processes. HT enabled.
>>>>>   135924 nps
>>>>>
>>>>>test 3: diep 2 processes K7 1.6ghz (registered DDR ram all other settings
>>>>>        identical to xeon dual setup):
>>>>>   146555
>>>>>
>>>>>THE 2 TESTS NALIMOV DIDN'T OR COULDN'T WANT TO DO WITH CRAFTY
>>>>>SOME WEEKS AGO REVEAL A BIG WEAKNESS OF HT/SMT:
>>>>>
>>>>>test 4: diep 2 processes. HT disabled.    171288 nps
>>>>>
>>>>>test 5 and 6: diep single cpu HT disabled and enabled were same speed
>>>>>   92090  nps versus 92019 nps.
>>>>
>>>>Crafty gets better results with HT, but it's been optimized for HT. It just
>>>
>>>That hasn't been proven yet.
>>>
>>>there was no test done without HT and 2 processors as far as i know.
>>>
>>>Please read how i tested it.
>>
>>I'm pretty sure he did non-HT tests too.
>>
>>>>means you need a personal Intel engineer to make it blazing fast for people who
>>>>plopped down $600 USD for a top-of-the-line Intel chip. Before long they'll
>>>>start selling Intel engineers in local computer shops. Collect all 18...
>>>
>>>Crafty is doing 2 probes in 2 hashtables for example. Remove it and
>>>improve it to 4 probes at 1 table (which is faster on both intel and
>>>AMD anyway, but AMD profits more because its chipset is cheaper).
>>>
>>>>HT is a good idea, and it works in practice rather than just on paper. It just
>>>>doesn't work for -everything-.
>>>
>>>in the factory they press 2 cpu's and put a single P4 sticker on it.
>>>You pay a factor 2 more, but get something 11.4% faster. For databases
>>>it was measured 11% rather than 11.4%.
>>>
>>>That's what i call a bad buy!
>>
>>CPUs since the Pentium have been pipelined. The goal is to spread the work out
>>so you can get a throughput of at least 1 op/cycle. Not always possible,
>
>1 op/cycle is very bad.
>
>Already when the pentiumpro 200 existed the average C program ran at
>1.76 instructions a clock.
>
>So 1 iop/cycle is very bad then.
>
>Considering that a K7 when calculated back to the pentiumpro200 is hell
>of a lot faster for each processor clock, then you will realize clearly
>that it's not worse than 1.76 now.
>
>Note that the 1.76 number wasn't measured by me. I have no idea how i
>measure this for DIEP otherwise i would know.

The Pentium Pro executed an average of 1.76 micro-ops/cycle. There is a big
difference between an average speed and an instantaneous speed. If I travel 10
km/h for the duration of 1 hour and then remain still for 9 hours, I had an
average speed of 1 km/h. I think the difference here is obvious enough that it
warrants no further discussion.

>I see however that processors like McKinley which can do more instructions
>a clock (6 a bundle) that it achieves way faster speeds for DIEP than
>any other processor. No being 64 bits has nothign to do with that.

Itanium is designed for high IPC. They don't have to ramp the clock speed to
make it fast. It is silly to criticize the Pentium 4 for low IPC when it is
-designed- for it. It's like criticizing a Geo Metro for being slow. Most people
will scratch their heads for a second, look at you funny, and then say, "Duh?"

In the case of Pentium 4, it seems paradoxial to achieve greater speeds with
lower IPC, but it is the strategy Intel picked. It is useless to criticize
further without actual facts (such as the fact that Pentium 4 is intended to hit
5 GHz).

>In contradiction to move a 64 bits word at the McKinley requires more
>power than moving a 32 bits word at a K7.
>
>Diep is a 32 bits program so if i compile 'int' then if the compiler makes
>that a 64 bits instruction for the mckinley that's not my worry.
>
>My happy feeling is the speed of it and it is doing very well. it is 33%
>faster (a bad compiled executable with a cross compiler) for each Ghz of
>a K7. 1.33Ghz K7 == 1.0 Ghz McKinley.

That's not very good. The 1 GHz Itanium is the upper-end Itanium right now. Your
1.33 GHz Athlon was slower than a 2.4 GHz Pentium 4 (which was equal to 1.6 GHz
Athlon you said), and an 800 MHz Itanium has performed as well as a 3 GHz
Pentium 4 for other people.

>Now that's *without* branch optimization yet for the McKinley. I don't know
>what speedup that will get for the mckinley but i compiled it for the
>itanium1 do not forget that. i didn't optimize for itanium2 at all.

Itanium 2 isn't out yet. McKinley is still Itanium 1. It's like the difference
between Thunderbird and AthlonXP.

>Nalimov probably has more details regarding this.
>
>It is clear to me at least that this itanium2 is a big winner.

I would hope so. Ever since I read about the IA-64 architecture it was obvious
to me that Intel engineers put tremendous thought into its design.

>>particularly with complex instructions. Every CPU since then had adhered to
>>superscalar designs.
>
>>The Pentium 4 is no different. It has an extremely long pipeline to enable it to
>>clock to higher frequencies.
>
>Exactly what i fear yes.

Why fear it? It's a natural progression. Branch prediction was -buggy- on a
Pentium, but it didn't matter because a mispredict didn't carry a huge penalty
with it. The Pentium can actually mispredict instructions that DON'T BRANCH.

Next was the P6 core, the PPro/Pentium 2/Pentium 3. Deeper pipeline, but they
had reordering and other stuff. Criticized for a deep pipeline. Athlon was
deeper than the K6. Now they release the Pentium 4 and Intel gets scathed for a
deep pipeline yet again. Why does it matter? They're still fast in most code
because most code doesn't branch mispredict.

I fixed some of my branches simply by changing the direction they jumped. A lot
of it is as simple as that. If I mispredict once in a loop of 1,000 iterations
(each branching), what's the difference?

>> The bulk of this pipeline is shared for each
>>"logical" CPU. They share caches, execution units, decoders, etc. The only thing
>>that gets duplicated is the register set, a smaller part of the CPU.
>>
>>>>>First conclusion is that the system is profitting only from HT when you
>>>>>use 4 processes at the same time, OTHERWISE IT IS A DISADVANTAGE IF
>>>>>YOU MULTITHREAD, because see the big difference between 2 processes
>>>>>running with HT turned on and off.
>>>>>
>>>>>In itself when you have a program with just 2 threads which you
>>>>>run on a dual it gets slower. My assumption is that the hardware reports
>>>>>4 cpu's and that the software doesn't care at what cpu to schedule
>>>>>the processes/threads. the result of that is that there is a 33% chance
>>>>>that things get scheduled at a cpu which is already running a thread/process.
>>>>>
>>>>>Resulting in a system where 1 cpu idles kind of shortly and 1 cpu is running
>>>>>2 threads/processes.
>>>>>
>>>>>Actually the actual chance that the 2 processes are scheduled at
>>>>>2 different processors (there is 4 processors for the OS
>>>>>times 3 processors left for the second process is 12 different
>>>>>schedulings) is: 8/12 = 2/3 = 66%. In short there is a disaster possibility
>>>>>of 33%.
>>>>
>>>>Yes, when one thread is scheduled on one processor, there are 3 choices for the
>>>>other thread, and one is disaster. 1/3 = 33%.
>>>
>>>>>Now the absolute speed from performance viewpoint. If the system idles
>>>>>completely and then starts to run *exclusively* diep at 4 processors, then
>>>>>the measured speedup as you can calculate is in the order of 11.4% for
>>>>>SMT/HT.
>>>>>
>>>>>That's not so much actually. The loss by searching parallel is at most
>>>>>parallel applications bigger than the win of 11.4%. In case of DIEP
>>>>>i am on the lucky side and go for that 11.4% faster speed.
>>>>>
>>>>>Yet the sad confirmation is that the pessimistic expectation about the
>>>>>absolute speed is completely confirmed. This system performs (assuming
>>>>>lineair scaling) like a 1.98 Ghz dual K7.
>>>>
>>>>If memory is a big issue for Diep, it probably won't scale linearly as memory
>>>>never does.
>>>
>>>It's a bigger issue for crafty than for DIEP. I hope you realize that
>>>this diep version is from 25 august 2002, that beta version runs pretty ok
>>>at cc-NUMA machines as well.
>>>
>>>Crafty doesn't though.
>>>
>>>>>there are motherboards now which do not require registered memory and
>>>>>the K7 runs already quite a while at 2.0Ghz in fact. Now i don't care
>>>>>for XP at all here nor do i care for the P4 at all. I just care for
>>>>>parallel search here.
>>>>>
>>>>>If we know that a 2.0Ghz dual K7 is identical to a dual 2.8Ghz Xeon
>>>>>and that in the majority of cases the K7 is going to win, then considering
>>>>>the huge price difference, the choice would be trivial for most who
>>>>>are looking for a lot of computing power for little money.
>>>>
>>>>AMD has always been better price/performance. Before the huge price differences
>>>>in AMD and Intel chips, the AMD chips meant your old Socket 7 board could be
>>>>used through ~500 MHz.
>>>
>>>>>Doesn't take away the fact that the P4 is winning ground. I remember
>>>>>the first dual AMD 1.2ghz test versus P4 dual 1.7Ghz and the AMD dual
>>>>>being 20% faster. Meaning in short that the speed of a P4 was performing
>>>>>about 1 : 1.7
>>>>>
>>>>>Now if i compare a dual Xeon 2.8Ghz with a 2Ghz K7 then it's equal
>>>>>meaning the P4 is performing 1 : 1.4
>>>>>
>>>>>So that's a big step forward!
>>>>
>>>>Well just about every application saw a similar gain from the 512 KB cache
>>>>Northwood from the 256 KB cache Williamette. The new Xeons, as I understand,
>>>>have 1 MB L3 cache in -addition- to the other caches. Don't quote me there. All
>>>>I know is that things changed. The extra cache makes the P4 competitive whereas
>>>
>>>It's the DDR ram that speeded DIEP and crafty up a lot. Not the bigger
>>>cache so much.
>>>
>>>DDR ram has nearly 2 times faster latency than RDRAM.
>>
>>You seem so sure, but you never tested a Northwood on RDRAM or a Williamette on
>>DDR SDRAM to know.
>
>I tested many P4s with RDRAM and they were very slow.
>
>It is trivial that if the latency is 2 times slower that this has
>a major impact onto things like hashtables (assuming the same
>processor gets put in the machine, because obviously a cpu matters
>way more than a bit faster ram).
>
>It's like creating an obstacle.
>
>there is eval hashtable. there is transposition hashtable. there is
>pawn hashtables. etcetera in diep.
>
>that doesn't fit *ever* in L2 cache at all.
>
>A big L2 cache matters basically when you start getting parallel.
>
>For a single cpu speed the L2 cache matters nothing at all.
>
>The initial tests that i did rdram versus ddr ram indicated 1.7 versus 1.5
>
>with SMT that's 1.4 now.
>
>Best regards,
>Vincent

L2 matters a lot, just not in your hash probes. A hash probe is subject to
latency. Not everything is.

It is probably very tempting to make the jump from "HT performance is x% in
chess engines" to "HT performance is x% in applications." You can't say that.
Chess engines are not representative of all applications. Neither your memory
accesses nor access patterns are representative of typical software.

I still haven't seen conclusive data, either. You need to run controlled tests.
You can't just change a bunch of things and say, "Oh, it was that one that
caused this effect."

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.