Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Hammer info. And som SMP musings.

Author: Vincent Diepeveen

Date: 07:37:02 03/24/02

Go up one level in this thread


On March 23, 2002 at 16:09:55, Tom Kerrigan wrote:

>On March 23, 2002 at 13:38:45, Dan Andersson wrote:
>
>>SMT is of little or no consequence to chess programs. It might even slow it
>>down. You don't think it automatically doubles the amount of functional units on
>>a given CPU, do you?
>
>You're completely missing the point. SMT was invented and implemented because
>most of a chip's functional units are idle at any given point in time--using
>them for another thread gives you free performance.
>
>I haven't seen any benchmarks yet, but a quad P4 Xeon will appear to software as
>an 8-way system and while it will probably not be as fast as a full-on 8-way
>system, it will be much faster than a 4-thread system.

complete nonsense. a single P4 can't even outgun a single cpu MP K7.
Not to mention a dual. A dual P4 Xeon 1.7Ghz is 20% slower than a
dual MP K7 1.2Ghz for DIEP. With 3 instructions a clock, 12KB instruction
trace cache and 1024 words for L1 datacache, it is of course insane to
run another process on a P4 processor at the same time.

The whole SMT is interesting for the future, but complete nonsense for p4.
Just a marketing hype. For sure not a single P4 can ever profit from it.

Note that 'thread' gets confused by process here too. Processes can
execute something, threads are all *forced* to do things indirect
using extra registers as indirection, so for me there is a clear
speeddifference already between the 2. Let's skip that difference
however, it is not important for the SMT discussion.

Another thing. how to *efficiently* use that idle time in
a processor which gets left?

Let's skip the OS problem for now, and focus upon how to get speedup
out of it with a chessprogram.

No one so far thought about that it seems, simply because the
majority of the chessprogrammers are not searching in parallel;
otherwise they would already have realized that it is impossible
to 'now and then' give a search thread some execution time.

Suppose that on a 2 processor Xeon system (forget 4 processor P4 boxes
for now) i run 4 search processes. Let's assume that i set
process A1 and A2 to processor A and that processes B1 and B2 are running
on processor B.

Whatever the 'idle' time on processor A, process A1 is simply executing
way way faster than A2. Also A2 is completely trashing the 1024 words
L1 data cache and small 12kb iop tracecache

So we start already directly at a loss.

Process A1 is sometimes LOCKING and so is process A2.
This is a crucial thing non-SMP programmers always tend to forget.

Also i have no idea what takes more system time of the processor for my
program, but i assume it is the decoding of integer instructions and
branches and most important the 3 instructions a clock limit.

It is of course impossible that process A2 can 'jump in' here and do
crucial integer decoding and execution while A1 is doing that. If A1
keeps it busy all the time, A2 can hardly hop in.

A2 can only do that when A1 is idling here. Realistically spoken
A1 is *only* busy doing this. So A2 will get very little system time.

The result is that you end up with a process A2 a 100 times slower than A1.
In short if it is locking something then A1 is continuesly busy with
integer eXCHanging (so not idling) and will do for a certain period of
time no work.

An important problem now is that if the processor sees
both processes as 'identical'
that now and then A1 and A2 will switch roles. Let's assume that
A1 and A2 switch roles. Then the total speed of the both processors
*significantly* slows down.

In short SMT is only possible if a processor is that potentially fast
that the only problem is that it is not possible to execute the process
faster, because of sequential dependencies in the execution path (branches
and such).

A first condition is obviously that a processor can do
that many instructions a clock, from which > 50% aren't used,
that a second process can hop in and get near the speed of
the other process.

That isn't the case with the P4 at all. Also when going from 3
to 4 instructions a clock we still will not see the problem.

P4 has a hardware limit of 3 instructions a clock. K7 i guess too.

If we get processors which can do 6 or more instructions a clock,
have huge onchip level caches (megabytes) which are very fast, then
we can think about SMT.

The whole SMT concept stinks IMHO to use as PR for the P4.
I prefer to see a single NEW processor
which has 32 independant processors on chip,
preferably all 32 bits processors ;)

>-Tom



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.