Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: SURPRISING RESULTS P4 Xeon dual 2.8Ghz

Author: Vincent Diepeveen
Date: 17:23:40 12/17/02
On December 17, 2002 at 13:14:56, Matt Taylor wrote:

i remember bob comparing a few years ago 1MB L2 cache with 2MB L2 cache
Xeons and concluded it mattered 0% for him.

Now i won't say it matters 0% for DIEP at all to go from 1MB to 2MB,
but at the R14000 i see no difference for diep between 8MB and 2MB
L2 cache either. Nearly no difference.

Best regards,
Vincent


>On December 17, 2002 at 12:30:40, Vincent Diepeveen wrote:
>
>>On December 17, 2002 at 11:59:48, Matt Taylor wrote:
>>
>>>On December 17, 2002 at 11:33:36, Vincent Diepeveen wrote:
>>>
>>>>On December 17, 2002 at 11:27:18, Matt Taylor wrote:
>>>>
>>>>>On December 17, 2002 at 10:10:46, Vincent Diepeveen wrote:
>>>>>
>>>>>>Hello,
>>>>>>
>>>>>>Some tests were performed in the USA, where some P4 Xeon dual 2.8Ghz
>>>>>>systems get delivered now. In Europe we can't get them yet and
>>>>>>most likely we don't want them either:
>>>>>>
>>>>>>Here are the results of DIEP at the Xeon 2.8Ghz dual ECC registered DDR ram.
>>>>>>
>>>>>>test 1: diep 4 processes. Of course HT enabled.
>>>>>>   181538 nps
>>>>>>
>>>>>>test 2: diep 2 processes. HT enabled.
>>>>>>   135924 nps
>>>>>>
>>>>>>test 3: diep 2 processes K7 1.6ghz (registered DDR ram all other settings
>>>>>>        identical to xeon dual setup):
>>>>>>   146555
>>>>>>
>>>>>>THE 2 TESTS NALIMOV DIDN'T OR COULDN'T WANT TO DO WITH CRAFTY
>>>>>>SOME WEEKS AGO REVEAL A BIG WEAKNESS OF HT/SMT:
>>>>>>
>>>>>>test 4: diep 2 processes. HT disabled.    171288 nps
>>>>>>
>>>>>>test 5 and 6: diep single cpu HT disabled and enabled were same speed
>>>>>>   92090  nps versus 92019 nps.
>>>>>
>>>>>Crafty gets better results with HT, but it's been optimized for HT. It just
>>>>
>>>>That hasn't been proven yet.
>>>>
>>>>there was no test done without HT and 2 processors as far as i know.
>>>>
>>>>Please read how i tested it.
>>>
>>>I'm pretty sure he did non-HT tests too.
>>>
>>>>>means you need a personal Intel engineer to make it blazing fast for people who
>>>>>plopped down $600 USD for a top-of-the-line Intel chip. Before long they'll
>>>>>start selling Intel engineers in local computer shops. Collect all 18...
>>>>
>>>>Crafty is doing 2 probes in 2 hashtables for example. Remove it and
>>>>improve it to 4 probes at 1 table (which is faster on both intel and
>>>>AMD anyway, but AMD profits more because its chipset is cheaper).
>>>>
>>>>>HT is a good idea, and it works in practice rather than just on paper. It just
>>>>>doesn't work for -everything-.
>>>>
>>>>in the factory they press 2 cpu's and put a single P4 sticker on it.
>>>>You pay a factor 2 more, but get something 11.4% faster. For databases
>>>>it was measured 11% rather than 11.4%.
>>>>
>>>>That's what i call a bad buy!
>>>
>>>CPUs since the Pentium have been pipelined. The goal is to spread the work out
>>>so you can get a throughput of at least 1 op/cycle. Not always possible,
>>
>>1 op/cycle is very bad.
>>
>>Already when the pentiumpro 200 existed the average C program ran at
>>1.76 instructions a clock.
>>
>>So 1 iop/cycle is very bad then.
>>
>>Considering that a K7 when calculated back to the pentiumpro200 is hell
>>of a lot faster for each processor clock, then you will realize clearly
>>that it's not worse than 1.76 now.
>>
>>Note that the 1.76 number wasn't measured by me. I have no idea how i
>>measure this for DIEP otherwise i would know.
>
>The Pentium Pro executed an average of 1.76 micro-ops/cycle. There is a big
>difference between an average speed and an instantaneous speed. If I travel 10
>km/h for the duration of 1 hour and then remain still for 9 hours, I had an
>average speed of 1 km/h. I think the difference here is obvious enough that it
>warrants no further discussion.
>
>>I see however that processors like McKinley which can do more instructions
>>a clock (6 a bundle) that it achieves way faster speeds for DIEP than
>>any other processor. No being 64 bits has nothign to do with that.
>
>Itanium is designed for high IPC. They don't have to ramp the clock speed to
>make it fast. It is silly to criticize the Pentium 4 for low IPC when it is
>-designed- for it. It's like criticizing a Geo Metro for being slow. Most people
>will scratch their heads for a second, look at you funny, and then say, "Duh?"
>
>In the case of Pentium 4, it seems paradoxial to achieve greater speeds with
>lower IPC, but it is the strategy Intel picked. It is useless to criticize
>further without actual facts (such as the fact that Pentium 4 is intended to hit
>5 GHz).
>
>>In contradiction to move a 64 bits word at the McKinley requires more
>>power than moving a 32 bits word at a K7.
>>
>>Diep is a 32 bits program so if i compile 'int' then if the compiler makes
>>that a 64 bits instruction for the mckinley that's not my worry.
>>
>>My happy feeling is the speed of it and it is doing very well. it is 33%
>>faster (a bad compiled executable with a cross compiler) for each Ghz of
>>a K7. 1.33Ghz K7 == 1.0 Ghz McKinley.
>
>That's not very good. The 1 GHz Itanium is the upper-end Itanium right now. Your
>1.33 GHz Athlon was slower than a 2.4 GHz Pentium 4 (which was equal to 1.6 GHz
>Athlon you said), and an 800 MHz Itanium has performed as well as a 3 GHz
>Pentium 4 for other people.
>
>>Now that's *without* branch optimization yet for the McKinley. I don't know
>>what speedup that will get for the mckinley but i compiled it for the
>>itanium1 do not forget that. i didn't optimize for itanium2 at all.
>
>Itanium 2 isn't out yet. McKinley is still Itanium 1. It's like the difference
>between Thunderbird and AthlonXP.
>
>>Nalimov probably has more details regarding this.
>>
>>It is clear to me at least that this itanium2 is a big winner.
>
>I would hope so. Ever since I read about the IA-64 architecture it was obvious
>to me that Intel engineers put tremendous thought into its design.
>
>>>particularly with complex instructions. Every CPU since then had adhered to
>>>superscalar designs.
>>
>>>The Pentium 4 is no different. It has an extremely long pipeline to enable it to
>>>clock to higher frequencies.
>>
>>Exactly what i fear yes.
>
>Why fear it? It's a natural progression. Branch prediction was -buggy- on a
>Pentium, but it didn't matter because a mispredict didn't carry a huge penalty
>with it. The Pentium can actually mispredict instructions that DON'T BRANCH.
>
>Next was the P6 core, the PPro/Pentium 2/Pentium 3. Deeper pipeline, but they
>had reordering and other stuff. Criticized for a deep pipeline. Athlon was
>deeper than the K6. Now they release the Pentium 4 and Intel gets scathed for a
>deep pipeline yet again. Why does it matter? They're still fast in most code
>because most code doesn't branch mispredict.
>
>I fixed some of my branches simply by changing the direction they jumped. A lot
>of it is as simple as that. If I mispredict once in a loop of 1,000 iterations
>(each branching), what's the difference?
>
>>> The bulk of this pipeline is shared for each
>>>"logical" CPU. They share caches, execution units, decoders, etc. The only thing
>>>that gets duplicated is the register set, a smaller part of the CPU.
>>>
>>>>>>First conclusion is that the system is profitting only from HT when you
>>>>>>use 4 processes at the same time, OTHERWISE IT IS A DISADVANTAGE IF
>>>>>>YOU MULTITHREAD, because see the big difference between 2 processes
>>>>>>running with HT turned on and off.
>>>>>>
>>>>>>In itself when you have a program with just 2 threads which you
>>>>>>run on a dual it gets slower. My assumption is that the hardware reports
>>>>>>4 cpu's and that the software doesn't care at what cpu to schedule
>>>>>>the processes/threads. the result of that is that there is a 33% chance
>>>>>>that things get scheduled at a cpu which is already running a thread/process.
>>>>>>
>>>>>>Resulting in a system where 1 cpu idles kind of shortly and 1 cpu is running
>>>>>>2 threads/processes.
>>>>>>
>>>>>>Actually the actual chance that the 2 processes are scheduled at
>>>>>>2 different processors (there is 4 processors for the OS
>>>>>>times 3 processors left for the second process is 12 different
>>>>>>schedulings) is: 8/12 = 2/3 = 66%. In short there is a disaster possibility
>>>>>>of 33%.
>>>>>
>>>>>Yes, when one thread is scheduled on one processor, there are 3 choices for the
>>>>>other thread, and one is disaster. 1/3 = 33%.
>>>>
>>>>>>Now the absolute speed from performance viewpoint. If the system idles
>>>>>>completely and then starts to run *exclusively* diep at 4 processors, then
>>>>>>the measured speedup as you can calculate is in the order of 11.4% for
>>>>>>SMT/HT.
>>>>>>
>>>>>>That's not so much actually. The loss by searching parallel is at most
>>>>>>parallel applications bigger than the win of 11.4%. In case of DIEP
>>>>>>i am on the lucky side and go for that 11.4% faster speed.
>>>>>>
>>>>>>Yet the sad confirmation is that the pessimistic expectation about the
>>>>>>absolute speed is completely confirmed. This system performs (assuming
>>>>>>lineair scaling) like a 1.98 Ghz dual K7.
>>>>>
>>>>>If memory is a big issue for Diep, it probably won't scale linearly as memory
>>>>>never does.
>>>>
>>>>It's a bigger issue for crafty than for DIEP. I hope you realize that
>>>>this diep version is from 25 august 2002, that beta version runs pretty ok
>>>>at cc-NUMA machines as well.
>>>>
>>>>Crafty doesn't though.
>>>>
>>>>>>there are motherboards now which do not require registered memory and
>>>>>>the K7 runs already quite a while at 2.0Ghz in fact. Now i don't care
>>>>>>for XP at all here nor do i care for the P4 at all. I just care for
>>>>>>parallel search here.
>>>>>>
>>>>>>If we know that a 2.0Ghz dual K7 is identical to a dual 2.8Ghz Xeon
>>>>>>and that in the majority of cases the K7 is going to win, then considering
>>>>>>the huge price difference, the choice would be trivial for most who
>>>>>>are looking for a lot of computing power for little money.
>>>>>
>>>>>AMD has always been better price/performance. Before the huge price differences
>>>>>in AMD and Intel chips, the AMD chips meant your old Socket 7 board could be
>>>>>used through ~500 MHz.
>>>>
>>>>>>Doesn't take away the fact that the P4 is winning ground. I remember
>>>>>>the first dual AMD 1.2ghz test versus P4 dual 1.7Ghz and the AMD dual
>>>>>>being 20% faster. Meaning in short that the speed of a P4 was performing
>>>>>>about 1 : 1.7
>>>>>>
>>>>>>Now if i compare a dual Xeon 2.8Ghz with a 2Ghz K7 then it's equal
>>>>>>meaning the P4 is performing 1 : 1.4
>>>>>>
>>>>>>So that's a big step forward!
>>>>>
>>>>>Well just about every application saw a similar gain from the 512 KB cache
>>>>>Northwood from the 256 KB cache Williamette. The new Xeons, as I understand,
>>>>>have 1 MB L3 cache in -addition- to the other caches. Don't quote me there. All
>>>>>I know is that things changed. The extra cache makes the P4 competitive whereas
>>>>
>>>>It's the DDR ram that speeded DIEP and crafty up a lot. Not the bigger
>>>>cache so much.
>>>>
>>>>DDR ram has nearly 2 times faster latency than RDRAM.
>>>
>>>You seem so sure, but you never tested a Northwood on RDRAM or a Williamette on
>>>DDR SDRAM to know.
>>
>>I tested many P4s with RDRAM and they were very slow.
>>
>>It is trivial that if the latency is 2 times slower that this has
>>a major impact onto things like hashtables (assuming the same
>>processor gets put in the machine, because obviously a cpu matters
>>way more than a bit faster ram).
>>
>>It's like creating an obstacle.
>>
>>there is eval hashtable. there is transposition hashtable. there is
>>pawn hashtables. etcetera in diep.
>>
>>that doesn't fit *ever* in L2 cache at all.
>>
>>A big L2 cache matters basically when you start getting parallel.
>>
>>For a single cpu speed the L2 cache matters nothing at all.
>>
>>The initial tests that i did rdram versus ddr ram indicated 1.7 versus 1.5
>>
>>with SMT that's 1.4 now.
>>
>>Best regards,
>>Vincent
>
>L2 matters a lot, just not in your hash probes. A hash probe is subject to
>latency. Not everything is.
>
>It is probably very tempting to make the jump from "HT performance is x% in
>chess engines" to "HT performance is x% in applications." You can't say that.
>Chess engines are not representative of all applications. Neither your memory
>accesses nor access patterns are representative of typical software.
>
>I still haven't seen conclusive data, either. You need to run controlled tests.
>You can't just change a bunch of things and say, "Oh, it was that one that
>caused this effect."
>
>-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.