Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SURPRISING RESULTS P4 Xeon dual 2.8Ghz

Author: Matt Taylor

Date: 09:50:48 12/17/02

Go up one level in this thread


On December 17, 2002 at 12:08:41, Vincent Diepeveen wrote:

>On December 17, 2002 at 11:50:20, Matt Taylor wrote:
>
>>On December 17, 2002 at 11:25:10, Vincent Diepeveen wrote:
>>
>>>On December 17, 2002 at 10:58:51, Bob Durrett wrote:
>>>
>>>
>>>Indeed you are correctly seeing that DIEP, which runs well on
>>>cc-NUMA machines as well, is a very good program from intels
>>>perspective, because even a 'second' processor on each physical
>>>processor which runs slower will still give it a speedboost,
>>>where others simply slow down a lot when you do such toying.
>>>
>>>So where many programs which will be way slower when running at
>>>4 processes/threads at a 2 processor Xeon, the software is the
>>>weak chain.
>>>
>>>In case of DIEP the bottleneck is the hardware clearly. Even
>>>something working great on cc-NUMA doesn't profit too much from
>>>the SMT/HT junk from intel.
>>
>>Clearly? It seems to me that memory is your bottleneck, and logical CPUs
>>obviously don't help you there.
>
>for the SMT/HT the memory isn't my bottleneck at all. the fact that
>it's not 2 real processors but something that has to wait for the
>other each time is the problem.

Actually it doesn't work like that. The CPU has an existing bandwidth of 3
micro-ops/cycle. It is rare that x86 code utilizes this full bandwidth. Anyway,
HT allows the CPU to run 2 threads literally at the same time. Literally.

Thread 1 schedules 2 micro-ops but can't fit in the third due to the fact that
its result is dependant on other things currently being computed. Thread 2 says,
"Yay, it's my birthday!" and schedules its next micro-op. When the three
micro-ops retire, the same thing happens again. If thread 1 enters a wait state
(hlt, pause, memory wait, etc.), it's not scheduling any micro-ops. Thread 2 now
has 3 micro-ops/cycle to utilize. Without HT, thread 2 executes a total of zero
micro-ops. With HT, thread 2 executes a total of more than zero micro-ops.

Now, I am no parallel researcher, but even my parallel code doesn't suffer
overheads so large that it can't gain from HT.

>>>Though it is a great sales argument, the hard facts (11.4%
>>>speedboost) are not lying.
>>
>>11.4% doesn't lie for chess, or at least for Diep. Intel didn't advertise, "Wow!
>>HT will make your chess programs run faster!" Intel said HT will get an average
>>of 30-40% speed gain across applications on -average-.
>
>That is a typical marketing thing. they compare HT versus HT. So
>2 processes HT versus 4 processes HT instead of  2 processes NON HT versus
>4 processes HT.
>
>If you look to diep's speeds you'll see that
>  181538 4 processes HT is a lot faster than 2 processes HT: 135924 nps.
>
>That's 33.6% speedup.
>
>However it is not a fair compare. The fair compare shows a 11.6% speedup.
>
>What was posted from crafty here was the unfair compare. No fair compare
>was posted so far.
>
>Who is testing objectively here?

You never said what "2 processes" was. Is it one physical CPU with HT or two
physical CPUs without HT?

Whether or not it's objective, nobody is going to listen if you don't do a good
job of clearly organizing and reporting your data. You didn't list clock speeds,
bus speeds, memory types, chip types, configurations (HT vs. non-HT), or any of
the other important information which is needed for anyone aside from you to
make any sort of decision based on that data.

>>>So they need to press 2 cpu's which results in a cpu price
>>>2 times higher *at least* than an AMD cpu, the result
>>>is that you win 11.4% in speed.
>>
>>Intel has always charged astronomical prices for their latest CPUs. HT isn't
>>driving the price up. Intel doesn't like losing profits.
>
>>In 6 months, the Pentium 4 3.06 GHz will be in the $200-$300 range just like the
>>Pentium 4 2.53 GHz is now. A year from now, it will cost $100-$200. Five years
>>from now, it will be on keychains.
>
>>>Though i am not a hardware engineer, i can imagine the problems
>>>they had getting this to work.
>
>>Yes, they had to build a mux and duplicate some components. The infrastructure
>>has been there for the past 5 years.
>
>>>Instead of a P4-Xeon cpu clocked at 2.8Ghz which can split itself
>>>into 2 physical processors, i would have preferred a P3-Xeon cpu
>>>which splitted itself into 2 real processors (so each having its
>>>own L1 and L2 caches) clocked at 2.0Ghz.
>>
>>They had trouble clocking the Pentium 3 above 1 GHz. It's been run at
>>frequencies from 150 MHz (the slowest Pentium Pro that I recall ever seeing, but
>>perhaps not the slowest) all the way up to 1.4 GHz. A design only scales so far.
>>Wouldn't it be nice if you could buy 3 GHz Athlons? Athlon just won't run at 3
>>GHz. Pentium 4 does because it's designed to. Pentium 3 wasn't even designed to
>>hit 1.4 GHz; it wouldn't go much further anyway.
>
>Athlon only recently is converted to 0.13

Yes, last February or so.

>the reason why the P4 clocks so high is because they use such a small
>L1 cache and a small trace cache (though compared to the data cache it's
>huge).

No. By that logic, the 386 which has -no- cache should clock higher than all of
them.

The P4 clocks high because it has a deep pipeline. Circuits have latency. It is
small, but it is there. You can only run a circuit so fast because the signals
need time to propigate from one end of the gate to the other. Shrinking the
circuit shrinks the traces, allowing the signals to get there faster. The end
result is a CPU that can clock higher. There are two ways to do this: shrinking
the process and lengthening the pipeline. Doing less work per cycle seems
counter-intuitive, but they get around that by having more gates do less work
per gate.

>What i dislike a lot is the huge branch misprediction penalty. I'm not
>a liar claiming that diep can get speeded up 2 times at the P4 when the
>p4 would not have such a very bad branch misprediction penalty.

Branch mispredictions almost never occur in well-written code. In poorly-written
code, they're easy to get. The Intel C compiler estimates the probability that a
branch will be taken and schedules it so the CPU will guess correctly as often
as possible.

Branch mispredicts affect the Athlon, Pentium 3, K6, and Pentium processors as
well. Believe it or not, an original Pentium can branch mispredict. It hurts
with deeper pipelines because you have to flush the entire pipeline and do all
that work over again. This is why Intel and AMD have both put extensive effort
into making adaptive branch prediction counters. If there is any cycle to be
found in a branch, their algorithms will find it. I looked at the algorithms
in-depth a year ago, and I was quite impressed with the one on the P6 core.

>also 1 decoder for new instructions i do not understand at all.

Because the trace cache caches the decoded output. I don't understand how they
get any performance out of that, but they obviously do somehow.

>Basically the P4 is a cpu where inefficient coding is getting rewarded.
>
>If you code very bad and need a lot of extra variables and instructions
>to get something done then the number of branches get kept relatively
>lower than a very efficient program which is doing a few instructions
>but can't prevent a branch there because other code needs execution.

...what? I've been doing x86 optimization for several years now, and I've never
come across anything that claims the number of instructions means anything
besides code size.

It is the -clever- code that avoids branches. Branches hurt on any x86 CPU, not
just Pentium 4. It takes heavily tweaked code to make the Pentium 4 run
efficiently. I don't see how inefficient coding is getting rewarded by the
Pentium 4 since the Athlon blows away the Pentium 4 in unoptimized code.

>Replacing branches by extra instructions is simply not possible anymore,
>because already when the pentiumpro came out, i already started slowly
>avoiding branches whenever i could. I had that thing around end of 1996
>if memory serves me well.

?

I was writing MMX code just yesterday to simulate long arithmetic including a
64-bit x 64-bit multiply. MMX does not support conditional branching. You have
to play cute little games such as, "When x = 0, y = -1, so I'm going to subtract
y from x."

I've seen non-branching implementations for all sorts of basic functions such as
popcount, min, max, and abs. In fact, there was a lengthy thread on this forum a
couple weeks ago about optimal popcount/bitscan functions. The branching one
came in dead last.

>>>That would have kicked anything of course from speed viewpoint as
>>>it scales 1 : 1.2 to a K7 (k7 20% faster for each Ghz than the P3).
>>>
>>>Now we end up with a very expensive cpu which is 1 : 1.4 and a bad
>>>working form of HT/SMT.
>>>
>>>So it's not DIEP having a problem here. But the hardware very clearly.
>>>Intel optimistically claims 20% speed boost here and there. Others
>>>claim 11% for database applications.
>>>
>>>I see 11.4% for DIEP. So that's a market conform viewpoint.
>>>
>>>The not so amazing thing of this all is that a 2.8Ghz Xeon being not
>>>deliverable yet here is very expensive (even a 3.06Ghz P4 is already 885
>>>euro in the shops here also not yet deliverable) and the MP2200 which
>>>DOES get offered for sales here is 290 euro. the fastest Xeon i see
>>>getting offered socket 603 is a 2.0Ghz Xeon for 829 euro at alternate.nl
>>>
>>>a dual motherboard for the P4 i see here is several:
>>>  789 euro for a dual xeon motherboard called: 860d pro (msi)
>>>  549 euro for a tyan S2720GN is by far the cheapest i see
>>>
>>>then you gotta buy ecc registered DDR ram for it.
>>>
>>>a dual motherboard for K7 i see at the same alternate.nl is:
>>>  259 euro for A7M266-D/U
>>>  299 euro chaintech 7KDD (dual; U-DMA/133 RAID en sound)    AMD-762MPX
>>>  289 euro tiger MPX S2466N-4M
>>>
>>>The last mainboard (tiger) for sure needs registered DDR ram. but lucky
>>>not ECC ram.
>>
>>AMD is always cheaper than Intel for the same level of performance.
>
>if you look how huge that P4 chip is compared to the AMD chip it is not
>a miracle either.
>
>knowing AMD has just 1 0.13 factory versus intel a lot it is not a miracle
>either that in the future this will remain the same.

The P4 is more expensive to produce because Intel needs wider wafers and gets
lower yields. However, it does not cost them $700 per chip.

>>Also, I own a TigerMPX S2466N-2M (only difference being that they don't mind
>>telling me to eat a PCI slot for USB). At one point I only had 1 256 MB
>>unregistered/non-ECC DIMM because my other 512 MB unregistered/non-ECC DIMM had
>>failed. I finally replaced both with a single 1 GB Registered/ECC DIMM.
>>
>>If anyone wants to send me a digital camera, I'll take pretty pictures of the
>>BIOS screens, my unregistered DIMM, and a working TigerMPX system on
>>unregistered ram.
>
>not all unregistered DIMMS do not work for a system requiring registered
>dimms. I can give you the names of 3 persons with problems with a Tiger
>(not sure they had MPX chipset though but the older tiger MP760 chipset
>i guess) who after a few days had severe stability problems with it and
>weird crashes each week or so.

Weird crashes = user problem or bad memory. I've had 2 out of my 4 DDR SDRAM
DIMMs go bad. I've yet to see bad RDRAM (I have also seen very little of it),
and I had -1- bad SDRAM DIMM once.

TigerMPX manual explicitly states that it's possible to use unregistered up to 2
DIMMs. To exceed 2 DIMMs or 512 MB/DIMM, you need Registered/ECC DIMMs. The ECC
is not necessary, but most ECC memory is Registered and vice versa.

I don't think the TigerMP required anything more than the MPX did, either. The
main differences include older 64-bit 33 MHz slots instead of 64-bit 66 MHz
slots and the 760 MP chipset instead of the 760 MPX chipset. The chipset
difference means the 760 MPX allows up to 4 GB of memory, and the 760 MP allows
up to 3 GB of memory.

I can't verify right now because I can't access Tyan's website. It's very slow
for me.

>>If I'm feeling generous, I'll also take pictures of my dual-AthlonMP 2000 system
>>at work.
>
>>>the P4 dual motherboards need for sure ecc registered stuff.
>>>
>>>The only good news is that ddr ram ecc registered is a lightyear cheaper
>>>than ecc registered RDRAM.
>>>
>>>RDRAM RIMM 256 MB (ValueRAM, ECC)    voor PC   PC1066   EUR 239,00
>>>now you can't need 256MB at all. You need more RAM than that. which is
>>>exponential more expensive i fear.
>>>
>>>You get better served with DDR ram though:
>>>  kingston 1GB DIMM 1 GB (Registered) for PC   PC266   EUR 599,00
>>>
>>>It is amazing how many professors and others still throw away money
>>>to get that dual 2.8Ghz P4 which is over 2 times more expensive than
>>>AMD dual at the moment is.
>>
>>Money grows on trees for some people. It is amazing how my coworkers convinced
>>management to purchase machines with Radeon 9700 Pro graphics cards for "work."
>>These cards were 20% of the cost of the whole machine at around $350 USD per
>>card.
>
>right ;)

I double-checked my estimation and $350 / $2000 = 17.5%, so I was pretty close.
Again, I'd take pretty pictures, but I don't have a digital camera.

I would also offer an explanation for the ATI Radeon 9700 Pro in my work
machine, but I can't fathom the logic myself. "Let's put the best video card
money can buy in their machines and ask them not to use it to play games."
Right.

>>Still, it is against social ettiquite to tell people how to spend their money.
>>If someone wants to throw away money, they're fully entitled to do so.
>>-Matt
>
>Obviously, but i want to get away the fairy tale that more expensive machines
>are always better.
>
>Of course there is a supercomputer league where price doesn't matter.
>
>where prices get measured in millions rather than thousands.
>
>In that category we don't talk about 11.4% speedups of course.
>
>But we talk about a 500 processor DIEP then at 500 real processors :)
>
>Yet we must be realistic and see that there's just 1 such a great supercomputer
>in whole netherlands with 1024 processors (www.sara.nl and click on the
>'teras'; owned by NWO: www.nwo.nl).
>
>then i realize again why i put in months of effort to rewrite diep to
>cc-NUMA (still busy improving it!) and why i won't spend time to manuals
>describing what SMT/HT is actually doing in hardware and what instructions
>can get parallellized and which instructions/actions cannot.
>
>Best regards,
>Vincent

Of course. Multiprocessing will scale performance beyond what a single-processor
can do, and NUMA architectures will take it further.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.