Author: Vincent Diepeveen
Date: 07:37:02 03/24/02
Go up one level in this thread
On March 23, 2002 at 16:09:55, Tom Kerrigan wrote: >On March 23, 2002 at 13:38:45, Dan Andersson wrote: > >>SMT is of little or no consequence to chess programs. It might even slow it >>down. You don't think it automatically doubles the amount of functional units on >>a given CPU, do you? > >You're completely missing the point. SMT was invented and implemented because >most of a chip's functional units are idle at any given point in time--using >them for another thread gives you free performance. > >I haven't seen any benchmarks yet, but a quad P4 Xeon will appear to software as >an 8-way system and while it will probably not be as fast as a full-on 8-way >system, it will be much faster than a 4-thread system. complete nonsense. a single P4 can't even outgun a single cpu MP K7. Not to mention a dual. A dual P4 Xeon 1.7Ghz is 20% slower than a dual MP K7 1.2Ghz for DIEP. With 3 instructions a clock, 12KB instruction trace cache and 1024 words for L1 datacache, it is of course insane to run another process on a P4 processor at the same time. The whole SMT is interesting for the future, but complete nonsense for p4. Just a marketing hype. For sure not a single P4 can ever profit from it. Note that 'thread' gets confused by process here too. Processes can execute something, threads are all *forced* to do things indirect using extra registers as indirection, so for me there is a clear speeddifference already between the 2. Let's skip that difference however, it is not important for the SMT discussion. Another thing. how to *efficiently* use that idle time in a processor which gets left? Let's skip the OS problem for now, and focus upon how to get speedup out of it with a chessprogram. No one so far thought about that it seems, simply because the majority of the chessprogrammers are not searching in parallel; otherwise they would already have realized that it is impossible to 'now and then' give a search thread some execution time. Suppose that on a 2 processor Xeon system (forget 4 processor P4 boxes for now) i run 4 search processes. Let's assume that i set process A1 and A2 to processor A and that processes B1 and B2 are running on processor B. Whatever the 'idle' time on processor A, process A1 is simply executing way way faster than A2. Also A2 is completely trashing the 1024 words L1 data cache and small 12kb iop tracecache So we start already directly at a loss. Process A1 is sometimes LOCKING and so is process A2. This is a crucial thing non-SMP programmers always tend to forget. Also i have no idea what takes more system time of the processor for my program, but i assume it is the decoding of integer instructions and branches and most important the 3 instructions a clock limit. It is of course impossible that process A2 can 'jump in' here and do crucial integer decoding and execution while A1 is doing that. If A1 keeps it busy all the time, A2 can hardly hop in. A2 can only do that when A1 is idling here. Realistically spoken A1 is *only* busy doing this. So A2 will get very little system time. The result is that you end up with a process A2 a 100 times slower than A1. In short if it is locking something then A1 is continuesly busy with integer eXCHanging (so not idling) and will do for a certain period of time no work. An important problem now is that if the processor sees both processes as 'identical' that now and then A1 and A2 will switch roles. Let's assume that A1 and A2 switch roles. Then the total speed of the both processors *significantly* slows down. In short SMT is only possible if a processor is that potentially fast that the only problem is that it is not possible to execute the process faster, because of sequential dependencies in the execution path (branches and such). A first condition is obviously that a processor can do that many instructions a clock, from which > 50% aren't used, that a second process can hop in and get near the speed of the other process. That isn't the case with the P4 at all. Also when going from 3 to 4 instructions a clock we still will not see the problem. P4 has a hardware limit of 3 instructions a clock. K7 i guess too. If we get processors which can do 6 or more instructions a clock, have huge onchip level caches (megabytes) which are very fast, then we can think about SMT. The whole SMT concept stinks IMHO to use as PR for the P4. I prefer to see a single NEW processor which has 32 independant processors on chip, preferably all 32 bits processors ;) >-Tom
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.