Author: Vincent Diepeveen
Date: 09:30:40 12/17/02
Go up one level in this thread
On December 17, 2002 at 11:59:48, Matt Taylor wrote: >On December 17, 2002 at 11:33:36, Vincent Diepeveen wrote: > >>On December 17, 2002 at 11:27:18, Matt Taylor wrote: >> >>>On December 17, 2002 at 10:10:46, Vincent Diepeveen wrote: >>> >>>>Hello, >>>> >>>>Some tests were performed in the USA, where some P4 Xeon dual 2.8Ghz >>>>systems get delivered now. In Europe we can't get them yet and >>>>most likely we don't want them either: >>>> >>>>Here are the results of DIEP at the Xeon 2.8Ghz dual ECC registered DDR ram. >>>> >>>>test 1: diep 4 processes. Of course HT enabled. >>>> 181538 nps >>>> >>>>test 2: diep 2 processes. HT enabled. >>>> 135924 nps >>>> >>>>test 3: diep 2 processes K7 1.6ghz (registered DDR ram all other settings >>>> identical to xeon dual setup): >>>> 146555 >>>> >>>>THE 2 TESTS NALIMOV DIDN'T OR COULDN'T WANT TO DO WITH CRAFTY >>>>SOME WEEKS AGO REVEAL A BIG WEAKNESS OF HT/SMT: >>>> >>>>test 4: diep 2 processes. HT disabled. 171288 nps >>>> >>>>test 5 and 6: diep single cpu HT disabled and enabled were same speed >>>> 92090 nps versus 92019 nps. >>> >>>Crafty gets better results with HT, but it's been optimized for HT. It just >> >>That hasn't been proven yet. >> >>there was no test done without HT and 2 processors as far as i know. >> >>Please read how i tested it. > >I'm pretty sure he did non-HT tests too. > >>>means you need a personal Intel engineer to make it blazing fast for people who >>>plopped down $600 USD for a top-of-the-line Intel chip. Before long they'll >>>start selling Intel engineers in local computer shops. Collect all 18... >> >>Crafty is doing 2 probes in 2 hashtables for example. Remove it and >>improve it to 4 probes at 1 table (which is faster on both intel and >>AMD anyway, but AMD profits more because its chipset is cheaper). >> >>>HT is a good idea, and it works in practice rather than just on paper. It just >>>doesn't work for -everything-. >> >>in the factory they press 2 cpu's and put a single P4 sticker on it. >>You pay a factor 2 more, but get something 11.4% faster. For databases >>it was measured 11% rather than 11.4%. >> >>That's what i call a bad buy! > >CPUs since the Pentium have been pipelined. The goal is to spread the work out >so you can get a throughput of at least 1 op/cycle. Not always possible, 1 op/cycle is very bad. Already when the pentiumpro 200 existed the average C program ran at 1.76 instructions a clock. So 1 iop/cycle is very bad then. Considering that a K7 when calculated back to the pentiumpro200 is hell of a lot faster for each processor clock, then you will realize clearly that it's not worse than 1.76 now. Note that the 1.76 number wasn't measured by me. I have no idea how i measure this for DIEP otherwise i would know. I see however that processors like McKinley which can do more instructions a clock (6 a bundle) that it achieves way faster speeds for DIEP than any other processor. No being 64 bits has nothign to do with that. In contradiction to move a 64 bits word at the McKinley requires more power than moving a 32 bits word at a K7. Diep is a 32 bits program so if i compile 'int' then if the compiler makes that a 64 bits instruction for the mckinley that's not my worry. My happy feeling is the speed of it and it is doing very well. it is 33% faster (a bad compiled executable with a cross compiler) for each Ghz of a K7. 1.33Ghz K7 == 1.0 Ghz McKinley. Now that's *without* branch optimization yet for the McKinley. I don't know what speedup that will get for the mckinley but i compiled it for the itanium1 do not forget that. i didn't optimize for itanium2 at all. Nalimov probably has more details regarding this. It is clear to me at least that this itanium2 is a big winner. >particularly with complex instructions. Every CPU since then had adhered to >superscalar designs. >The Pentium 4 is no different. It has an extremely long pipeline to enable it to >clock to higher frequencies. Exactly what i fear yes. > The bulk of this pipeline is shared for each >"logical" CPU. They share caches, execution units, decoders, etc. The only thing >that gets duplicated is the register set, a smaller part of the CPU. > >>>>First conclusion is that the system is profitting only from HT when you >>>>use 4 processes at the same time, OTHERWISE IT IS A DISADVANTAGE IF >>>>YOU MULTITHREAD, because see the big difference between 2 processes >>>>running with HT turned on and off. >>>> >>>>In itself when you have a program with just 2 threads which you >>>>run on a dual it gets slower. My assumption is that the hardware reports >>>>4 cpu's and that the software doesn't care at what cpu to schedule >>>>the processes/threads. the result of that is that there is a 33% chance >>>>that things get scheduled at a cpu which is already running a thread/process. >>>> >>>>Resulting in a system where 1 cpu idles kind of shortly and 1 cpu is running >>>>2 threads/processes. >>>> >>>>Actually the actual chance that the 2 processes are scheduled at >>>>2 different processors (there is 4 processors for the OS >>>>times 3 processors left for the second process is 12 different >>>>schedulings) is: 8/12 = 2/3 = 66%. In short there is a disaster possibility >>>>of 33%. >>> >>>Yes, when one thread is scheduled on one processor, there are 3 choices for the >>>other thread, and one is disaster. 1/3 = 33%. >> >>>>Now the absolute speed from performance viewpoint. If the system idles >>>>completely and then starts to run *exclusively* diep at 4 processors, then >>>>the measured speedup as you can calculate is in the order of 11.4% for >>>>SMT/HT. >>>> >>>>That's not so much actually. The loss by searching parallel is at most >>>>parallel applications bigger than the win of 11.4%. In case of DIEP >>>>i am on the lucky side and go for that 11.4% faster speed. >>>> >>>>Yet the sad confirmation is that the pessimistic expectation about the >>>>absolute speed is completely confirmed. This system performs (assuming >>>>lineair scaling) like a 1.98 Ghz dual K7. >>> >>>If memory is a big issue for Diep, it probably won't scale linearly as memory >>>never does. >> >>It's a bigger issue for crafty than for DIEP. I hope you realize that >>this diep version is from 25 august 2002, that beta version runs pretty ok >>at cc-NUMA machines as well. >> >>Crafty doesn't though. >> >>>>there are motherboards now which do not require registered memory and >>>>the K7 runs already quite a while at 2.0Ghz in fact. Now i don't care >>>>for XP at all here nor do i care for the P4 at all. I just care for >>>>parallel search here. >>>> >>>>If we know that a 2.0Ghz dual K7 is identical to a dual 2.8Ghz Xeon >>>>and that in the majority of cases the K7 is going to win, then considering >>>>the huge price difference, the choice would be trivial for most who >>>>are looking for a lot of computing power for little money. >>> >>>AMD has always been better price/performance. Before the huge price differences >>>in AMD and Intel chips, the AMD chips meant your old Socket 7 board could be >>>used through ~500 MHz. >> >>>>Doesn't take away the fact that the P4 is winning ground. I remember >>>>the first dual AMD 1.2ghz test versus P4 dual 1.7Ghz and the AMD dual >>>>being 20% faster. Meaning in short that the speed of a P4 was performing >>>>about 1 : 1.7 >>>> >>>>Now if i compare a dual Xeon 2.8Ghz with a 2Ghz K7 then it's equal >>>>meaning the P4 is performing 1 : 1.4 >>>> >>>>So that's a big step forward! >>> >>>Well just about every application saw a similar gain from the 512 KB cache >>>Northwood from the 256 KB cache Williamette. The new Xeons, as I understand, >>>have 1 MB L3 cache in -addition- to the other caches. Don't quote me there. All >>>I know is that things changed. The extra cache makes the P4 competitive whereas >> >>It's the DDR ram that speeded DIEP and crafty up a lot. Not the bigger >>cache so much. >> >>DDR ram has nearly 2 times faster latency than RDRAM. > >You seem so sure, but you never tested a Northwood on RDRAM or a Williamette on >DDR SDRAM to know. I tested many P4s with RDRAM and they were very slow. It is trivial that if the latency is 2 times slower that this has a major impact onto things like hashtables (assuming the same processor gets put in the machine, because obviously a cpu matters way more than a bit faster ram). It's like creating an obstacle. there is eval hashtable. there is transposition hashtable. there is pawn hashtables. etcetera in diep. that doesn't fit *ever* in L2 cache at all. A big L2 cache matters basically when you start getting parallel. For a single cpu speed the L2 cache matters nothing at all. The initial tests that i did rdram versus ddr ram indicated 1.7 versus 1.5 with SMT that's 1.4 now. Best regards, Vincent >>>before P4 performance was something of an oxymoron, a joke among the people >>>who'd seen its scores, and a disappointment for former Intel fans. >> >>>You'll probably observe the trend shift (not -completely-) toward the former >>>when AMD releases Barton, likewise equipped with 512 KB of L2 cache. >> >>512KB is better than 256KB but i do not believe that the changing of just >>the cache is going to improve the thing a lot. Getting it to 0.13 and >>also clocking it at 3 Ghz will have more of an impact i bet. > >The size of the core doesn't affect performance directly. It affects how high >the CPU gets clocked. The CPU can only do real work on the edges of a clock >cycle. It doesn't matter how small it gets; if the CPU receives a 1 Hz clock, >it's going to go 1 Hz, and that's pretty slow. > >In computationally-intensive applications, clock speed will yield linear >increases in performance. However, you haven't posted results for Diep on a wide >variety; you've only posted the four benchmarks. Little can be discerned except >which system is faster. That yield no useful information about the architecture >or how clock rate affects performance or how ram affects performance. There is >no data. > >>>>Whether the step is because of DDR ram versus the very bad performing >>>>RDRAM (nearly 2 times slower latency) is a matter of open discussion. >>>> >>>>HT/SMT in itself is not so impressing now. >>>> >>>>It's trivial to say that it will get impressive when the P4 can split itself >>>>into 2 real processors having little dependencies on each other. >>>> >>>>Right now the single cpu win on a P4 3.06Ghz HT (18%) is >>>>clearly more than the older generation 2.8 Ghz HT/SMT. so it seems >>>>also this technique is slowly winning in realism. >>>> >>>>Right now i can't take what's getting on the market now very serious. >>>> >>>>Best regards, >>>>Vincent
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.