Author: Vincent Diepeveen
Date: 17:23:40 12/17/02
Go up one level in this thread
On December 17, 2002 at 13:14:56, Matt Taylor wrote: i remember bob comparing a few years ago 1MB L2 cache with 2MB L2 cache Xeons and concluded it mattered 0% for him. Now i won't say it matters 0% for DIEP at all to go from 1MB to 2MB, but at the R14000 i see no difference for diep between 8MB and 2MB L2 cache either. Nearly no difference. Best regards, Vincent >On December 17, 2002 at 12:30:40, Vincent Diepeveen wrote: > >>On December 17, 2002 at 11:59:48, Matt Taylor wrote: >> >>>On December 17, 2002 at 11:33:36, Vincent Diepeveen wrote: >>> >>>>On December 17, 2002 at 11:27:18, Matt Taylor wrote: >>>> >>>>>On December 17, 2002 at 10:10:46, Vincent Diepeveen wrote: >>>>> >>>>>>Hello, >>>>>> >>>>>>Some tests were performed in the USA, where some P4 Xeon dual 2.8Ghz >>>>>>systems get delivered now. In Europe we can't get them yet and >>>>>>most likely we don't want them either: >>>>>> >>>>>>Here are the results of DIEP at the Xeon 2.8Ghz dual ECC registered DDR ram. >>>>>> >>>>>>test 1: diep 4 processes. Of course HT enabled. >>>>>> 181538 nps >>>>>> >>>>>>test 2: diep 2 processes. HT enabled. >>>>>> 135924 nps >>>>>> >>>>>>test 3: diep 2 processes K7 1.6ghz (registered DDR ram all other settings >>>>>> identical to xeon dual setup): >>>>>> 146555 >>>>>> >>>>>>THE 2 TESTS NALIMOV DIDN'T OR COULDN'T WANT TO DO WITH CRAFTY >>>>>>SOME WEEKS AGO REVEAL A BIG WEAKNESS OF HT/SMT: >>>>>> >>>>>>test 4: diep 2 processes. HT disabled. 171288 nps >>>>>> >>>>>>test 5 and 6: diep single cpu HT disabled and enabled were same speed >>>>>> 92090 nps versus 92019 nps. >>>>> >>>>>Crafty gets better results with HT, but it's been optimized for HT. It just >>>> >>>>That hasn't been proven yet. >>>> >>>>there was no test done without HT and 2 processors as far as i know. >>>> >>>>Please read how i tested it. >>> >>>I'm pretty sure he did non-HT tests too. >>> >>>>>means you need a personal Intel engineer to make it blazing fast for people who >>>>>plopped down $600 USD for a top-of-the-line Intel chip. Before long they'll >>>>>start selling Intel engineers in local computer shops. Collect all 18... >>>> >>>>Crafty is doing 2 probes in 2 hashtables for example. Remove it and >>>>improve it to 4 probes at 1 table (which is faster on both intel and >>>>AMD anyway, but AMD profits more because its chipset is cheaper). >>>> >>>>>HT is a good idea, and it works in practice rather than just on paper. It just >>>>>doesn't work for -everything-. >>>> >>>>in the factory they press 2 cpu's and put a single P4 sticker on it. >>>>You pay a factor 2 more, but get something 11.4% faster. For databases >>>>it was measured 11% rather than 11.4%. >>>> >>>>That's what i call a bad buy! >>> >>>CPUs since the Pentium have been pipelined. The goal is to spread the work out >>>so you can get a throughput of at least 1 op/cycle. Not always possible, >> >>1 op/cycle is very bad. >> >>Already when the pentiumpro 200 existed the average C program ran at >>1.76 instructions a clock. >> >>So 1 iop/cycle is very bad then. >> >>Considering that a K7 when calculated back to the pentiumpro200 is hell >>of a lot faster for each processor clock, then you will realize clearly >>that it's not worse than 1.76 now. >> >>Note that the 1.76 number wasn't measured by me. I have no idea how i >>measure this for DIEP otherwise i would know. > >The Pentium Pro executed an average of 1.76 micro-ops/cycle. There is a big >difference between an average speed and an instantaneous speed. If I travel 10 >km/h for the duration of 1 hour and then remain still for 9 hours, I had an >average speed of 1 km/h. I think the difference here is obvious enough that it >warrants no further discussion. > >>I see however that processors like McKinley which can do more instructions >>a clock (6 a bundle) that it achieves way faster speeds for DIEP than >>any other processor. No being 64 bits has nothign to do with that. > >Itanium is designed for high IPC. They don't have to ramp the clock speed to >make it fast. It is silly to criticize the Pentium 4 for low IPC when it is >-designed- for it. It's like criticizing a Geo Metro for being slow. Most people >will scratch their heads for a second, look at you funny, and then say, "Duh?" > >In the case of Pentium 4, it seems paradoxial to achieve greater speeds with >lower IPC, but it is the strategy Intel picked. It is useless to criticize >further without actual facts (such as the fact that Pentium 4 is intended to hit >5 GHz). > >>In contradiction to move a 64 bits word at the McKinley requires more >>power than moving a 32 bits word at a K7. >> >>Diep is a 32 bits program so if i compile 'int' then if the compiler makes >>that a 64 bits instruction for the mckinley that's not my worry. >> >>My happy feeling is the speed of it and it is doing very well. it is 33% >>faster (a bad compiled executable with a cross compiler) for each Ghz of >>a K7. 1.33Ghz K7 == 1.0 Ghz McKinley. > >That's not very good. The 1 GHz Itanium is the upper-end Itanium right now. Your >1.33 GHz Athlon was slower than a 2.4 GHz Pentium 4 (which was equal to 1.6 GHz >Athlon you said), and an 800 MHz Itanium has performed as well as a 3 GHz >Pentium 4 for other people. > >>Now that's *without* branch optimization yet for the McKinley. I don't know >>what speedup that will get for the mckinley but i compiled it for the >>itanium1 do not forget that. i didn't optimize for itanium2 at all. > >Itanium 2 isn't out yet. McKinley is still Itanium 1. It's like the difference >between Thunderbird and AthlonXP. > >>Nalimov probably has more details regarding this. >> >>It is clear to me at least that this itanium2 is a big winner. > >I would hope so. Ever since I read about the IA-64 architecture it was obvious >to me that Intel engineers put tremendous thought into its design. > >>>particularly with complex instructions. Every CPU since then had adhered to >>>superscalar designs. >> >>>The Pentium 4 is no different. It has an extremely long pipeline to enable it to >>>clock to higher frequencies. >> >>Exactly what i fear yes. > >Why fear it? It's a natural progression. Branch prediction was -buggy- on a >Pentium, but it didn't matter because a mispredict didn't carry a huge penalty >with it. The Pentium can actually mispredict instructions that DON'T BRANCH. > >Next was the P6 core, the PPro/Pentium 2/Pentium 3. Deeper pipeline, but they >had reordering and other stuff. Criticized for a deep pipeline. Athlon was >deeper than the K6. Now they release the Pentium 4 and Intel gets scathed for a >deep pipeline yet again. Why does it matter? They're still fast in most code >because most code doesn't branch mispredict. > >I fixed some of my branches simply by changing the direction they jumped. A lot >of it is as simple as that. If I mispredict once in a loop of 1,000 iterations >(each branching), what's the difference? > >>> The bulk of this pipeline is shared for each >>>"logical" CPU. They share caches, execution units, decoders, etc. The only thing >>>that gets duplicated is the register set, a smaller part of the CPU. >>> >>>>>>First conclusion is that the system is profitting only from HT when you >>>>>>use 4 processes at the same time, OTHERWISE IT IS A DISADVANTAGE IF >>>>>>YOU MULTITHREAD, because see the big difference between 2 processes >>>>>>running with HT turned on and off. >>>>>> >>>>>>In itself when you have a program with just 2 threads which you >>>>>>run on a dual it gets slower. My assumption is that the hardware reports >>>>>>4 cpu's and that the software doesn't care at what cpu to schedule >>>>>>the processes/threads. the result of that is that there is a 33% chance >>>>>>that things get scheduled at a cpu which is already running a thread/process. >>>>>> >>>>>>Resulting in a system where 1 cpu idles kind of shortly and 1 cpu is running >>>>>>2 threads/processes. >>>>>> >>>>>>Actually the actual chance that the 2 processes are scheduled at >>>>>>2 different processors (there is 4 processors for the OS >>>>>>times 3 processors left for the second process is 12 different >>>>>>schedulings) is: 8/12 = 2/3 = 66%. In short there is a disaster possibility >>>>>>of 33%. >>>>> >>>>>Yes, when one thread is scheduled on one processor, there are 3 choices for the >>>>>other thread, and one is disaster. 1/3 = 33%. >>>> >>>>>>Now the absolute speed from performance viewpoint. If the system idles >>>>>>completely and then starts to run *exclusively* diep at 4 processors, then >>>>>>the measured speedup as you can calculate is in the order of 11.4% for >>>>>>SMT/HT. >>>>>> >>>>>>That's not so much actually. The loss by searching parallel is at most >>>>>>parallel applications bigger than the win of 11.4%. In case of DIEP >>>>>>i am on the lucky side and go for that 11.4% faster speed. >>>>>> >>>>>>Yet the sad confirmation is that the pessimistic expectation about the >>>>>>absolute speed is completely confirmed. This system performs (assuming >>>>>>lineair scaling) like a 1.98 Ghz dual K7. >>>>> >>>>>If memory is a big issue for Diep, it probably won't scale linearly as memory >>>>>never does. >>>> >>>>It's a bigger issue for crafty than for DIEP. I hope you realize that >>>>this diep version is from 25 august 2002, that beta version runs pretty ok >>>>at cc-NUMA machines as well. >>>> >>>>Crafty doesn't though. >>>> >>>>>>there are motherboards now which do not require registered memory and >>>>>>the K7 runs already quite a while at 2.0Ghz in fact. Now i don't care >>>>>>for XP at all here nor do i care for the P4 at all. I just care for >>>>>>parallel search here. >>>>>> >>>>>>If we know that a 2.0Ghz dual K7 is identical to a dual 2.8Ghz Xeon >>>>>>and that in the majority of cases the K7 is going to win, then considering >>>>>>the huge price difference, the choice would be trivial for most who >>>>>>are looking for a lot of computing power for little money. >>>>> >>>>>AMD has always been better price/performance. Before the huge price differences >>>>>in AMD and Intel chips, the AMD chips meant your old Socket 7 board could be >>>>>used through ~500 MHz. >>>> >>>>>>Doesn't take away the fact that the P4 is winning ground. I remember >>>>>>the first dual AMD 1.2ghz test versus P4 dual 1.7Ghz and the AMD dual >>>>>>being 20% faster. Meaning in short that the speed of a P4 was performing >>>>>>about 1 : 1.7 >>>>>> >>>>>>Now if i compare a dual Xeon 2.8Ghz with a 2Ghz K7 then it's equal >>>>>>meaning the P4 is performing 1 : 1.4 >>>>>> >>>>>>So that's a big step forward! >>>>> >>>>>Well just about every application saw a similar gain from the 512 KB cache >>>>>Northwood from the 256 KB cache Williamette. The new Xeons, as I understand, >>>>>have 1 MB L3 cache in -addition- to the other caches. Don't quote me there. All >>>>>I know is that things changed. The extra cache makes the P4 competitive whereas >>>> >>>>It's the DDR ram that speeded DIEP and crafty up a lot. Not the bigger >>>>cache so much. >>>> >>>>DDR ram has nearly 2 times faster latency than RDRAM. >>> >>>You seem so sure, but you never tested a Northwood on RDRAM or a Williamette on >>>DDR SDRAM to know. >> >>I tested many P4s with RDRAM and they were very slow. >> >>It is trivial that if the latency is 2 times slower that this has >>a major impact onto things like hashtables (assuming the same >>processor gets put in the machine, because obviously a cpu matters >>way more than a bit faster ram). >> >>It's like creating an obstacle. >> >>there is eval hashtable. there is transposition hashtable. there is >>pawn hashtables. etcetera in diep. >> >>that doesn't fit *ever* in L2 cache at all. >> >>A big L2 cache matters basically when you start getting parallel. >> >>For a single cpu speed the L2 cache matters nothing at all. >> >>The initial tests that i did rdram versus ddr ram indicated 1.7 versus 1.5 >> >>with SMT that's 1.4 now. >> >>Best regards, >>Vincent > >L2 matters a lot, just not in your hash probes. A hash probe is subject to >latency. Not everything is. > >It is probably very tempting to make the jump from "HT performance is x% in >chess engines" to "HT performance is x% in applications." You can't say that. >Chess engines are not representative of all applications. Neither your memory >accesses nor access patterns are representative of typical software. > >I still haven't seen conclusive data, either. You need to run controlled tests. >You can't just change a bunch of things and say, "Oh, it was that one that >caused this effect." > >-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.