Author: Robert Hyatt
Date: 10:03:02 11/03/02
Go up one level in this thread
On November 03, 2002 at 10:07:07, Vincent Diepeveen wrote: >On November 02, 2002 at 00:10:17, Robert Hyatt wrote: > >At the P4 with 1 decoder, 12K i cache and just 8KB data cache >i could measure no speedup. Only slow downs if i tried to run >too many threads. > >Your claims with crafty proofs it fits within the trace cache somehow. > >Also i tested at SINGLE CPU P4s and there i could measure no speedup >at all. Only disasters. I will test crafty on the things too. I don't know how you test. I have a dual PIV/2.2ghz here. mt=4 is faster than mt=2. Not anywhere near twice as fast, but some testing I ran when we first got this machine produced a 30% improvement using one processor with mt=1, vs one processor with mt=2 (we added the second processor after the fact so I got to test with just one at first.) Eugene's data seems to match up with mine. I'm not sure you understand what the "instruction trace cache" is used for as you seem to mention it repeatedly without giving any explanation about why it is a problem. Ditto for the register renaming stuff... In any case, there is _no way_ Crafty fits in even a 16kb L1 cache so I won't comment further on that point... > >>On November 01, 2002 at 17:26:39, Eugene Nalimov wrote: >> >>>Vincent, >>> >>>I am explaining it to you the 3rd time: I can run tests on those systems, but I >>>have no physical access to them, so I cannot turn something in BIOS on or off. >>>That's why I compared results of 2.4GHz system with hyperthreading on and 2.8Hz >>>one with hyperthreading off -- to show that results are the same if you'll take >>>into account the speed difference. >>> >>>Net result: you can look at the numbers I posted, and you will definitely see >>>that hyperthreading gives current Crafty, without any hyperthread-related >>>modifications, double-digit improvement. >>> >>>Thanks, >>>Eugene >> >> >>You are wasting your time. He has made up his mind, declared hyper-threading >>worthless, and that is that. >> >> >>> >>>On November 01, 2002 at 17:07:10, Vincent Diepeveen wrote: >>> >>>>On November 01, 2002 at 14:55:50, Eugene Nalimov wrote: >>>> >>>>So you have a P4 2.8 then to your avail. >>>> >>>>Can you post the results of that P4 2.8 single cpu for the next 4 >>>>results: >>>> >>>>first: >>>> P4 2.8 SMT in bios off and >>>> a) MT 1 >>>> b) MT 2 >>>> >>>>secondly: >>>> P4 2.8 SMT in bios on and >>>> a) MT 1 >>>> b) MT 2 >>>> >>>>Thanks in advance, >>>>Vincent >>>> >>>>>Once again: the system I run that test on is located in other building. I don't >>>>>want to bother the friend with rebooting/changing settings/etc. I run the test >>>>>on a 2.8Hz P4 with hyperthreading turned off, and got 50 seconds at 1,113knps. >>>>>50*(2.8/2.4) == 58, so 57 seconds looks about right. (I think it is slightly >>>>>slower than estimate because memory on 2.4GHz system is slower than on 2.8GHz >>>>>one). >>>>> >>>>>I run the same executable on AMD/2000. It tooks 56 seconds at 994knps to run the >>>>>test, so 57 seconds at 976knps again looks right. >>>>> >>>>>Thanks, >>>>>Eugene >>>>> >>>>>On November 01, 2002 at 14:35:21, Vincent Diepeveen wrote: >>>>> >>>>>>On November 01, 2002 at 13:58:06, Eugene Nalimov wrote: >>>>>> >>>>>>>I cannot produce the test you are demanding, as I don't have physical access to >>>>>>>the system on which I run the test, but here are my results. >>>>>>> >>>>>>>Dual P4/2.4GHz, hyperthreating turned on, Windows XP Professional. >>>>>>>Unmodified Crafty 19.0 (i.e. with "bad" spinlock loop). >>>>>>>"Bench" results (executable restarted after each test). >>>>>> >>>>>>also do the tests with SMT disabled in bios, >>>>>>it should produce the same results as in MT 1 and MT 2. >>>>>>If not then something different is wrong. In MT 4 it should >>>>>>produce something real bad there. >>>>>> >>>>>>Amazing that with 976 MT 1 you need only 57 seconds to finish the >>>>>>test. Single cpu AMD i need (but of course a bit older crafty version): >>>>>> >>>>>>White(1): hash 400MB >>>>>>hash table memory = 384M bytes. >>>>>>White(1): hashp 16MB >>>>>>pawn hash table memory = 10M bytes. >>>>>>White(1): bench >>>>>>Running benchmark. . . >>>>>>...... >>>>>>Total nodes: 92683962 >>>>>>Raw nodes per second: 827535 >>>>>>Total elapsed time: 112 >>>>>>SMP time-to-ply measurement: 5.714286 >>>>>>White(1): quit >>>>>>execution complete. >>>>>> >>>>>>Or in short 112 seconds (visual c++ 6.0 sp4 proc pack default compile) >>>>>>and 827 K nps. >>>>>> >>>>>>You need millions of nodes less? >>>>>> >>>>>>>mt=1: 976knps, 57 seconds >>>>>>>mt=2: 1,705knps, 38 seconds >>>>>>>mt=4: 2,006knps, 35 seconds >>>>>>> >>>>>>>I.e. there is not only ~17% raw nps speedup, but *absolute time* is also ~8% >>>>>>>smaller. >>>>>>> >>>>>>>And that is for the executable that is non-hyperthread aware, i.e. contains bad >>>>>>>spinlock loop. >>>>>>> >>>>>>>I tested exactly the executable that is on Bob's FTP site. You can download it >>>>>>>yourself. >>>>>>> >>>>>>>Thanks, >>>>>>>Eugene >>>>>>> >>>>>>>On November 01, 2002 at 13:06:53, Vincent Diepeveen wrote: >>>>>>> >>>>>>>>On November 01, 2002 at 12:20:14, Robert Hyatt wrote: >>>>>>>> >>>>>>>>Feel free to ship a version of crafty that doesn't do spinlock >>>>>>>>or whatever you want to modify. I'll extensively test it for you >>>>>>>>at all P4s i can get my hands on... >>>>>>>> >>>>>>>>I would be really amazed if you get even 0.1% faster in nodes a >>>>>>>>second... >>>>>>>> >>>>>>>>...of course it must be a fair compare in contradiction to what >>>>>>>>intel shows. They do next comparision >>>>>>>> >>>>>>>> a) some feature called 'SMT' in the bios turned on >>>>>>>> - just running 2 threads then >>>>>>>> b) turning it off >>>>>>>> - also running 2 threads at it >>>>>>>> >>>>>>>>Like everyone who is not so naive we know that you also need >>>>>>>>to do next test: >>>>>>>> >>>>>>>> a) some feature called 'SMT' in the bios turned on >>>>>>>> - just running 1 thread eating all system time >>>>>>>> b) turning it off >>>>>>>> - also running 1 thread eating all system time >>>>>>>> >>>>>>>>There shouldn't be a speed difference between a and b of course. >>>>>>>>>>>>>>>That verification step is missing. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>On November 01, 2002 at 11:56:56, Vincent Diepeveen wrote: >>>>>>>>> >>>>>>>>>>On November 01, 2002 at 10:41:25, Robert Hyatt wrote: >>>>>>>>>> >>>>>>>>>>>On October 31, 2002 at 10:53:07, Vincent Diepeveen wrote: >>>>>>>>>>> >>>>>>>>>>>>On October 30, 2002 at 06:59:21, Terje Vagle wrote: >>>>>>>>>>>> >>>>>>>>>>>>>Hi all, >>>>>>>>>>>>> >>>>>>>>>>>>>The new cpu from intel will have a new function called >>>>>>>>>>>>>hyper-threading. >>>>>>>>>>>>> >>>>>>>>>>>>>This will make the operating system able to recognize the cpu as if it was >>>>>>>>>>>>>2 cpu's. >>>>>>>>>>>>> >>>>>>>>>>>>>Could the programs with smp-support make use of this? >>>>>>>>>>>>> >>>>>>>>>>>>>Regards, >>>>>>>>>>>>> >>>>>>>>>>>>>Terje Vagle >>>>>>>>>>>> >>>>>>>>>>>>No chessprograms cannot make use of that feature at all. It is sad but >>>>>>>>>>>>the truth. Hyperthreading is a cool thing for the future but the P4 >>>>>>>>>>>>processor is a too small processor to allow hyperthreading from getting >>>>>>>>>>>>to work. >>>>>>>>>>>> >>>>>>>>>>>>Apart from that a major problem is that even if we have a great processor >>>>>>>>>>>>which really allows hyperthreading to be effective, that the threads >>>>>>>>>>>>run at unequal speeds. >>>>>>>>>>>> >>>>>>>>>>>>Hyper threading is supposed to work for 2 threads where 1 is a fast >>>>>>>>>>>>thread and the other is some kind of background thread eating little cpu >>>>>>>>>>>>time. >>>>>>>>>>>> >>>>>>>>>>>>In chessprograms having a second search thread which just runs now and >>>>>>>>>>>>then in the background is simply impossible to use. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>It is not impossible at all. The only problem was spinlocks and Eugene >>>>>>>>>>>posted a link to an Intel document that describes how to solve this problem. >>>>>>>>>>> >>>>>>>>>>>Given that solution, hyper-threading will work just fine since spinlocks >>>>>>>>>>>won't confuse the processor... >>>>>>>>>>> >>>>>>>>>>>It won't be 2x faster, but it will certainly be faster if you can run a second >>>>>>>>>>>thread while the first is blocked on a memory access... >>>>>>>>>> >>>>>>>>>>No it won't be 2 times faster. suppose you start crafty with 2 threads. >>>>>>>>> >>>>>>>>>I didn't say it would be _two_ times faster. >>>>>>>>> >>>>>>>>>I said it would be _faster_. >>>>>>>>> >>>>>>>>>And it will. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>thread A starts search and has 1.e4,e5 >>>>>>>>>>thread B starts and continues with 1.d4 >>>>>>>>>> >>>>>>>>>>now when A is ready, B will still be busy with its own search space, >>>>>>>>>>and delay thread A time and again. >>>>>>>>>> >>>>>>>>>>that'll slow down incredible. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>Except that isn't how it works. The threads co-execute in an intermingled >>>>>>>>>way as one blocks for a memory read the other fills in the gap. It is >>>>>>>>>something like having 1.5 cpus... and it does work. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>You'll be a lot slower than searching with a single thread! >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>Not very likely... >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>Also note that there is just 8 KB data cache and just like >>>>>>>>>>40 registers to rename variables. then another 12KB tracecache. >>>>>>>>>> >>>>>>>>>>*both* threads are eating from that 8 KB and 12KB tracecache, >>>>>>>>>>that is an additional problem they 'overlook'. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>That is a problem on an SMP machine. But _both_ threads are executing >>>>>>>>>the _same_ code anyway... so that isn't a problem. At least for me. >>>>>>>>> >>>>>>>>>For you it is different because you are not using "shared everything" in >>>>>>>>>lightweight threads, so your results might be different. But all my threads >>>>>>>>>share the exact same executable instruction code... >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>As you can see from graphs. Usually SMT brings zero speedup. >>>>>>>>> >>>>>>>>>I have seen numbers around 1.3 up to 1.5... which is not to be >>>>>>>>>ignored. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>Try crafty on a 2.4Ghz single cpu P4 or P4-Xeon please (northwood) or >>>>>>>>>>above. Not on a slower P4 or P4-Xeon. Of course we go for the latest >>>>>>>>>>hardware... >>>>>>>>> >>>>>>>>> >>>>>>>>>Why does it matter? Hyper-Threading is Hyper-Threading, unless you are >>>>>>>>>going to start that memory speed nonsense. And, in fact, the faster the >>>>>>>>>processor vs memory speed, the better hyperthreading should perform. Just >>>>>>>>>like the greater the difference in processor speed vs disk speed, the better >>>>>>>>>normal operating systems do at running multiple processes. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>Just try it like i tried at Jan Louwman's 2.4Ghz P4s and 2.53Ghz P4s. >>>>>>>>> >>>>>>>>>That says it all. "Like I tried it". As if that is a comprehensive and >>>>>>>>>exhaustive testing? >>>>>>>>> >>>>>>>>>> >>>>>>>>>>I can't measure *any* speedup *anyhow*. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>Why am I not surprised??? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>Also theoreticlaly i see major problems for the P4 chip even if you >>>>>>>>>>have software which could theoretically profit. >>>>>>>>> >>>>>>>>> >>>>>>>>>"theoretically". >>>>>>>>> >>>>>>>>>:) >>>>>>>>> >>>>>>>>>:) >>>>>>>>> >>>>>>>>>:) >>>>>>>>> >>>>>>>>>Theory from someone that doesn't know theory. >>>>>>>>> >>>>>>>>>:) >>>>>>>>> >>>>>>>>>:)
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.