Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: other stuff answerred

Author: Matt Taylor

Date: 09:48:48 12/18/02

Go up one level in this thread


On December 17, 2002 at 21:47:22, Vincent Diepeveen wrote:

>On December 17, 2002 at 13:14:56, Matt Taylor wrote:
>
>>On December 17, 2002 at 12:30:40, Vincent Diepeveen wrote:
>>
>>>On December 17, 2002 at 11:59:48, Matt Taylor wrote:
>>>
>>>>On December 17, 2002 at 11:33:36, Vincent Diepeveen wrote:
>>>>
>>>>>On December 17, 2002 at 11:27:18, Matt Taylor wrote:
>>>>>
>>>>>>On December 17, 2002 at 10:10:46, Vincent Diepeveen wrote:
>>>>>>
>>>>>>>Hello,
>>>>>>>
>>>>>>>Some tests were performed in the USA, where some P4 Xeon dual 2.8Ghz
>>>>>>>systems get delivered now. In Europe we can't get them yet and
>>>>>>>most likely we don't want them either:
>>>>>>>
>>>>>>>Here are the results of DIEP at the Xeon 2.8Ghz dual ECC registered DDR ram.
>>>>>>>
>>>>>>>test 1: diep 4 processes. Of course HT enabled.
>>>>>>>   181538 nps
>>>>>>>
>>>>>>>test 2: diep 2 processes. HT enabled.
>>>>>>>   135924 nps
>>>>>>>
>>>>>>>test 3: diep 2 processes K7 1.6ghz (registered DDR ram all other settings
>>>>>>>        identical to xeon dual setup):
>>>>>>>   146555
>>>>>>>
>>>>>>>THE 2 TESTS NALIMOV DIDN'T OR COULDN'T WANT TO DO WITH CRAFTY
>>>>>>>SOME WEEKS AGO REVEAL A BIG WEAKNESS OF HT/SMT:
>>>>>>>
>>>>>>>test 4: diep 2 processes. HT disabled.    171288 nps
>>>>>>>
>>>>>>>test 5 and 6: diep single cpu HT disabled and enabled were same speed
>>>>>>>   92090  nps versus 92019 nps.
>>>>>>
>>>>>>Crafty gets better results with HT, but it's been optimized for HT. It just
>>>>>
>>>>>That hasn't been proven yet.
>>>>>
>>>>>there was no test done without HT and 2 processors as far as i know.
>>>>>
>>>>>Please read how i tested it.
>>>>
>>>>I'm pretty sure he did non-HT tests too.
>>>>
>>>>>>means you need a personal Intel engineer to make it blazing fast for people who
>>>>>>plopped down $600 USD for a top-of-the-line Intel chip. Before long they'll
>>>>>>start selling Intel engineers in local computer shops. Collect all 18...
>>>>>
>>>>>Crafty is doing 2 probes in 2 hashtables for example. Remove it and
>>>>>improve it to 4 probes at 1 table (which is faster on both intel and
>>>>>AMD anyway, but AMD profits more because its chipset is cheaper).
>>>>>
>>>>>>HT is a good idea, and it works in practice rather than just on paper. It just
>>>>>>doesn't work for -everything-.
>>>>>
>>>>>in the factory they press 2 cpu's and put a single P4 sticker on it.
>>>>>You pay a factor 2 more, but get something 11.4% faster. For databases
>>>>>it was measured 11% rather than 11.4%.
>>>>>
>>>>>That's what i call a bad buy!
>>>>
>>>>CPUs since the Pentium have been pipelined. The goal is to spread the work out
>>>>so you can get a throughput of at least 1 op/cycle. Not always possible,
>>>
>>>1 op/cycle is very bad.
>>>
>>>Already when the pentiumpro 200 existed the average C program ran at
>>>1.76 instructions a clock.
>>>
>>>So 1 iop/cycle is very bad then.
>>>
>>>Considering that a K7 when calculated back to the pentiumpro200 is hell
>>>of a lot faster for each processor clock, then you will realize clearly
>>>that it's not worse than 1.76 now.
>>>
>>>Note that the 1.76 number wasn't measured by me. I have no idea how i
>>>measure this for DIEP otherwise i would know.
>>
>>The Pentium Pro executed an average of 1.76 micro-ops/cycle. There is a big
>>difference between an average speed and an instantaneous speed. If I travel 10
>>km/h for the duration of 1 hour and then remain still for 9 hours, I had an
>>average speed of 1 km/h. I think the difference here is obvious enough that it
>>warrants no further discussion.
>
>>>I see however that processors like McKinley which can do more instructions
>>>a clock (6 a bundle) that it achieves way faster speeds for DIEP than
>>>any other processor. No being 64 bits has nothign to do with that.
>
>>Itanium is designed for high IPC. They don't have to ramp the clock speed to
>>make it fast. It is silly to criticize the Pentium 4 for low IPC when it is
>>-designed- for it. It's like criticizing a Geo Metro for being slow. Most people
>>will scratch their heads for a second, look at you funny, and then say, "Duh?"
>
>No one will agree with you here. For the speed they can reach with it
>it is very crucial what IPC you get. Simple as that. If you get to
>2.8 Ghz and get 0.00000000000001 IPC then you can't say "that's the goal".
>
>You want to be competative so you want to execute stuff faster.
>
>If you execute a program faster then that's a combination of how much
>Ghz of horse power you throw into it and the IPC you get.

Think about it. You just gave the equation for raw performance: IPS = IPC *
clockrate. If you want to maximize IPS (raw performance), you need a winning
combination of IPC and clockrate.

A lot of people are hung up on IPC. If I produce a processor that sustains 4 IPC
on average but can only run 100 MHz, how am I going to beat a 500 MHz processor
that averages 1 IPC? The processor I built is going to be 400 MIPS, and the
competing processor is 500 MIPS even though it has a lower IPC.

It is certainly dubious that a processor with atrocious IPC will win in raw
performance, but it is possible. In this case, a lot of claims are thrown around
about P4. Most simply aren't true. I was told a long time ago that the P4 takes
4-6 cycles to shift. It does, but someone forgot to mention that the 4-6 cycle
figure is latency. It has a throughput of 1 shift/cycle.

If you look back on the history of Intel processors, they have focused for so
long on IPC, which is probably one reason why everyone scratches their head when
thinking about the P4 design. Timings from the 8086 through Pentium get
consistently better. The Pentium could execute 2 IPC in a lot of cases. The
PPro/P2/P3 could execute up to 3 IPC. The P4 can execute up to 3 IPC, but it is
much harder to fully utilize that limit on P4. At twice the clockrate of the
highest clocked P3, I don't see how this is an issue. They're going higher, too.
This is Intel's strategy.

>>In the case of Pentium 4, it seems paradoxial to achieve greater speeds with
>>lower IPC, but it is the strategy Intel picked. It is useless to criticize
>>further without actual facts (such as the fact that Pentium 4 is intended to hit
>>5 GHz).
>
>Fact: cheap K7 beats P4 for DIEP.
>
>Proven over and again. Even the latest release of the Xeon at 2.8Ghz
>is getting kicked by a 2Ghz MP in case of DIEP. That's *real bad* for
>the P4 if you consider its a x86 processor.
>
>Note that shouting 5Ghz for a 0.13 micron design is pretty weird.
>How to ever get above 3.5Ghz with the 0.13micron P4 with air cooling?
>
>I have to see that first before i believe it!
>
>with 3.06ghz the P4 is very close to its end. The K7 will be clocked
>till 3Ghz of course too. AMD just needs a bit more time to get new
>technology to work it seems than intel needs.

Of course Diep wins on a K7. I'm not even going to count how many times I have
said, "P4 loses until Intel engineers optimize code for it." AMD's processors
since the K6 have been FASTER than Intel processors on
unoptimized/poorly-optimized code. That's almost a moot point considering Intel
sends engineers to Microsoft and many other places.

Also, nobody said 5 GHz at 0.13 microns. Intel said 5 GHz. They will hit 5 GHz.
It may require another shrink, but they've already got it planned.

No, the K7 will not be clocked to 3 GHz. The K7 has one more iteration to go.
Check AMD's roadmap:
http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_608,00.html

>>>In contradiction to move a 64 bits word at the McKinley requires more
>>>power than moving a 32 bits word at a K7.
>>>
>>>Diep is a 32 bits program so if i compile 'int' then if the compiler makes
>>>that a 64 bits instruction for the mckinley that's not my worry.
>>>
>>>My happy feeling is the speed of it and it is doing very well. it is 33%
>>>faster (a bad compiled executable with a cross compiler) for each Ghz of
>>>a K7. 1.33Ghz K7 == 1.0 Ghz McKinley.
>>
>>That's not very good. The 1 GHz Itanium is the upper-end Itanium right now. Your
>
>That's *world conquering* in fact. No other cpu gets even *close* to
>the k7. Here is my list of performance for DIEP each Ghz,
>starting with the worst:
>
>   Alpha 21164
>   Itanium1
>   P4
>   K6
>   P2
>   UltrasparcII
>   P6 pentiumpro
>   P3 coppermine (20% faster than P2)
>   IBM Power (latest one, sorry i always forget whether it's power3 or 4 or 5)
>   Alpha 21264
>   Sun Ultrasparc IIIcu
>   AMD K7 (above 1.2Ghz; not the old 1ghz K7s which performed worse)
>   R14000 (of course it is clocked 500Mhz which in 1999 was great)
>   McKinley/Itanium2
>
>Do not forget in your mind the HUGE jumpt Mckinley made after the
>horrible introduction of the itanium1. A McKinley is exactly THREE
>times faster than an itanium1 (at the same speed measured of course
>800Mhz versus 800Mhz).

And again, you're talking about IPC. IPC is not a direct metric for raw
performance.

It's strange that Itanium 2 is only just above K7 for you. The IA-64 breaks most
bottlenecks related to branching and whatnot. It ought to be nearly double the
K7. Crafty kicks on Itanium 2.

>McKinley 1Ghz costs about 7000$ a piece. Very cheap for the performance
>a single 1Ghz processor delivers. Don't hesitate to call SGI if you want
>to order a supercomputer with McKinleys inside. They can help you further.
>
>It kicks any other cc-NUMA system!
>
>Just look at the bandwidth these systems deliver, you'll faint!
>
>The current TERAS already delivers 1 terabyte a second what do i need
>to say more?
>
>Also all those supercomputer cpu's are clocked around 1Ghz. Only IBMs
>inferior thing is clocked to 1.3ghz but that 1.3Ghz doesn't make up
>for it much.
>
>Do not forget that the actual speed of the mckinley which i posted
>here some time ago is based upon an old test with a cross compiler
>which was very old and meant to run only for itanium1.
>
>When i compile native itanium2 and do some branch optimizations
>by using profile info, then it is a lot faster i bet!

So you told me that Itanium 2 is only 25% faster than K7 and you've never
compiled with the Itanium 2 compiler?

...

>Now if i may chose between
>
>  500Ghz McKinley or 500Ghz Ultra (note there is no SUN machines with
>  that many processors which you can call with some luck cc-NUMA) or 32
>  processor IBM Power (a single calculation unit out of 32 1.3Ghz
>  IBM Power processors is of course inferior to that 500Ghz)
>  Just compare. What is going to beat 500Ghz McKinley for DIEP?

Where can I buy the 500 GHz Itanium 2? Or is Eugene the only one with one on his
desk?

>  *nothing*.
>
>  Only about the SUN processors i can say that they are underrated in
>  the computerchess world. Bob is always saying they suck, but they do not.
>
>  They are fine processors. But not *close* to McKinley performance!
>
>  IBM is just cheating too much at the specint tests, that McKinley is
>  really kicking butt there.
>
>  *nothing* kicks it. really nothing.
>
>  Even for those who want to do vector processing it is still performing
>  pretty ok with 6 instructions a bundle. Cray doing like 29 or whatever
>  but that Cray is very expensive for each cpuhour and you can get like
>  10 McKinleys for each cray processor or so *easily* within a single
>  partition.
>
>Then just $7000 a piece. This is a buy!!

I'll look you up when I have $7000 to blow on my hobbies.

>When i arrived at SGI short before world champs 2002 i was very afraid
>for the speed of the R14000. Knowing it is a revised R12000 processor.
>R12000 originally is a mips processor but the R14000 isn't redesigned
>by mips at all but by DEC or NEC) i feared the worst when looking
>at its small L1 cache. But it performed ok. Slightly faster than K7.
>Perhaps 0.5% at initial testing (but that was a 32 versus 32 bits
>test, like all the above tests are; nowhere i took in the tests
>advantage of the 64 bits which i'll do for world champs 2003 though
>at the McKinley.
>
>So i was very amazed by the R14000 when i actually ran first tests.
>I am not so impressed by the mips pro compiler however. Much more
>impressed by itanium1 compiler on the other hand. a DEFAULT -O2
>compile performing that well also at the McKinley. It's *incredible*.

Do you know why Itanium's compiler is so awesome?

It's a VLIW instruction set. Intel calls it EPIC just because they want to be
anal. Anyway, the architecture has a lot of tidbits to overcome problems that
computing has seen in the past. They have "predicate" registers that can
conditionally no-op an instruction (read: fast min/max/abs/other small
conditionals). They have a dedicated loop counter for predicition of loops. They
have 128 integer, floating-point, and vector ("media") registers. They have a
real hardware stack. The list goes on...

>I needed to test a long time before i had figured out which options
>all hurted me in the mipspro compiler. I find it in general a very
>bad thing if -O3 at a compiler runs you 5% slower than -O2.
>
>What i miss or didn't figure out yet is how i can use profile info
>for the mipspro to get my exe faster.
>
>but to save you another hour of stories and compiler horrors,
>it can't hide that the mckinley is a big winner cpu!
>
>>1.33 GHz Athlon was slower than a 2.4 GHz Pentium 4 (which was equal to 1.6 GHz
>>Athlon you said), and an 800 MHz Itanium has performed as well as a 3 GHz
>>Pentium 4 for other people.
>
>I get really the impression you don't see the difference between an
>itanium2=mckinley=800Mhz,1Ghz (supposed to get released at
>1.2Ghz too) and itanium1=800Mhz or slower
>
>>>Now that's *without* branch optimization yet for the McKinley. I don't know
>>>what speedup that will get for the mckinley but i compiled it for the
>>>itanium1 do not forget that. i didn't optimize for itanium2 at all.
>>
>>Itanium 2 isn't out yet. McKinley is still Itanium 1. It's like the difference
>>between Thunderbird and AthlonXP.
>
>then it's time to buy intel stocks for you, that itanium2 is going to
>kill away all other supercomputer chip manufacturers!
>
>>>Nalimov probably has more details regarding this.
>>>
>>>It is clear to me at least that this itanium2 is a big winner.
>>
>>I would hope so. Ever since I read about the IA-64 architecture it was obvious
>>to me that Intel engineers put tremendous thought into its design.
>
>I do not know whether McKinley is 0.13 or 0.18 from head, nor do i know
>how high one can clock a 64 bits chip anyway, but to me it's clear that
>if anyone can clock a supercomputer chip above 2Ghz *ever* , then it's
>going to be intel for sure.
>
>If they ever manage to clock an itanium3, even with level2 and 3 caches
>running at 1/2 of 1/3 speed, at like 2Ghz, and with a better form of
>HT/SMT than the P4 has, then it's going to wipe away anything on
>this planet.

Um...back in the Celeron-1 days, a Celeron-1 was a big winner over a Pentium 2
because the cache, though smaller, ran at processor speed. Intel would lose BIG
if they tried to clock down the caches and make them slower.

A processor that gets more IPC and has a larger data word will be more memory
starved than other 1 GHz processors. Therefore it is even more necessary to have
adaquate cache.

>But imagine what a 512Ghz McKinley supercomputer means actually.
>
>What it *represents*. Not taking into account the massive hashtables
>(because each year bigger RAM sizes get available too).

It represents fantasy. I heard that Itanium was intended to clock to 13 GHz, but
I have trouble even believing that.

>But assuming that you have a 2.6ghz K7 world champs 2003,
>just from hardware viewpoint getting to that 512Ghz means you miss
>a factor of 256 exactly (just assuming the data i have now that
>1.3Ghz K7 == 1.0Ghz McKinley; though mckinley will be definitely
>get a lot faster when DIEP is tuned for it).
>
>2^8 = 256.
>
>That means 2 * 8 = 16 years.
>
>16 years is a lot!!

??

I don't follow. At all.

>Of course software quality will get better in 10 years of time.
>Branching factor will get better. Other things will get discovered
>too (but no major search things. *no way*. perhaps more efficient
>combinations of nullmove with hashtables).
>
>So speed matters!
>
>>>>particularly with complex instructions. Every CPU since then had adhered to
>>>>superscalar designs.
>>>
>>>>The Pentium 4 is no different. It has an extremely long pipeline to enable it to
>>>>clock to higher frequencies.
>>>
>>>Exactly what i fear yes.
>>
>>Why fear it? It's a natural progression. Branch prediction was -buggy- on a
>>Pentium, but it didn't matter because a mispredict didn't carry a huge penalty
>>with it. The Pentium can actually mispredict instructions that DON'T BRANCH.
>
>>Next was the P6 core, the PPro/Pentium 2/Pentium 3. Deeper pipeline, but they
>>had reordering and other stuff. Criticized for a deep pipeline. Athlon was
>>deeper than the K6. Now they release the Pentium 4 and Intel gets scathed for a
>>deep pipeline yet again. Why does it matter? They're still fast in most code
>>because most code doesn't branch mispredict.
>
>you forget a crucial aspect. the P3 and the K7 practical could execute more
>instructions a clock than the previous generations of them could.
>
>P3 was 20% faster than P2. P2 was slower than pentium pro, but basically
>same core (L2 cache was clocked down and BTB bigger and L1 cache
>bigger). Each newer generation was faster.
>
>NOT WITH THE P4 !!!!!!!!!
>
>the p4 is SLOWER.

clock-for-clock. Again, "Duh."

>I see it as a big marketing succes. Selling something slower for more
>money and selling it as faster because it has a bigger number on it
>(1.7Ghz instead of k7 had 1.2ghz to 1.4ghz).

IPS = IPC * clockrate

Are you trying to tell me that a 3.06 GHz Pentium 4 is slower than a 1.4 GHz
Pentium 3? If you say yes, after I stop laughing I'll hunt people down with both
chips and get some data.

Low-end P4s are slower than a P3. Intel marketting has obviously jumped on
consumer ignorance. How does Intel marketting make the processor slower? Did
they bore the Pentium 4 into "sleep mode" with random nonsensical bullshit?

>So P4 is the actual confirmation again that the average person is very
>dumb and goes for something with a bigger number on it.

Oh, I get it. Being in the average consumer PC makes it slower.

>I do not know whether it is possible to check at what speed a processor
>runs internally. If not, then produce a processor called X8. Claim it
>runs 4Ghz and you'll sell *very* well.

On x86, yes. I have written the code that does it many times. On boot up, the
first thing my OS does is enumerate all processors and compute their clock
frequencies.

The funny thing about Athlon is that it decides what its name is based on the
clock frequency it runs at. I was bored one day and fiddling with my bus speed,
and my AthlonMP 1600 chips suddenly decided they were AthlonMP 1800 chips. This
is read directly out of the processor. I know because I wrote the software.

>Of course also have your own compiler team to let it score well on
>specint and let them crosspost to all nerds world wide that it's
>2 times faster than any other processor because of a feature called
>superspeed which executes a program a lot faster when it needs to
>get executed faster than the other concurrent programs running.

So what's preventing AMD from releasing their own compiler?
Oh, that's right. It's just one more thing that Intel has been doing for years
that AMD doesn't.

>that X8 will be sold *very* well then.
>
>>I fixed some of my branches simply by changing the direction they jumped. A lot
>>of it is as simple as that. If I mispredict once in a loop of 1,000 iterations
>>(each branching), what's the difference?
>
>here is the diep problem (but also other chessprograms have this):
> if( general pattern ) {
>  if( pattern )
>    then evaluate
>
>  if( pattern )
>    then evaluate
>
>  if( pattern )
>    then evaluate
>
>   ..
> }
>
>You see the problem is that each pattern is build up from simple elements.
>So if the general pattern is taken then it will get a lot of
>mispredictions in the patterns that it tries.
>
>Let's estimate it at 30% of the patterns.
>
>That's still a horrible amount of mispredictions as you get a lot of
>them within a short period of time.

Um...that's not how branch predicition works. They have default predictions
based on branch direction, and after they see a branch, it goes in a little
table. When they re-encounter the branch, they use the table data to predict
where it's going to branch.

Branch prediction algorithms are pretty good, too. Some of them can even decide
that, since you took the first branch, you won't take any of the others.

>It would be a lie to say that diep's speed profile is much different
>from other programs. More or less all the chessprograms tend to get
>the same problems.

You mean other chess programs. Weren't we talking about HT for the general
application? I can name a lot of general applications that Diep looks nothing
like.

>If a new processor gets released (32 bits) and single cpu crafty
>is 20% slower at it than at a K7, then i know in advance that it is
>very likely that i am 20% slower at it than at a k7.
>
>Only parallel and at 64 bits things change of course. Crafty *flies*
>at the McKinley.
>
>>>> The bulk of this pipeline is shared for each
>>>>"logical" CPU. They share caches, execution units, decoders, etc. The only thing
>>>>that gets duplicated is the register set, a smaller part of the CPU.
>>>>
>>>>>>>First conclusion is that the system is profitting only from HT when you
>>>>>>>use 4 processes at the same time, OTHERWISE IT IS A DISADVANTAGE IF
>>>>>>>YOU MULTITHREAD, because see the big difference between 2 processes
>>>>>>>running with HT turned on and off.
>>>>>>>
>>>>>>>In itself when you have a program with just 2 threads which you
>>>>>>>run on a dual it gets slower. My assumption is that the hardware reports
>>>>>>>4 cpu's and that the software doesn't care at what cpu to schedule
>>>>>>>the processes/threads. the result of that is that there is a 33% chance
>>>>>>>that things get scheduled at a cpu which is already running a thread/process.
>>>>>>>
>>>>>>>Resulting in a system where 1 cpu idles kind of shortly and 1 cpu is running
>>>>>>>2 threads/processes.
>>>>>>>
>>>>>>>Actually the actual chance that the 2 processes are scheduled at
>>>>>>>2 different processors (there is 4 processors for the OS
>>>>>>>times 3 processors left for the second process is 12 different
>>>>>>>schedulings) is: 8/12 = 2/3 = 66%. In short there is a disaster possibility
>>>>>>>of 33%.
>>>>>>
>>>>>>Yes, when one thread is scheduled on one processor, there are 3 choices for the
>>>>>>other thread, and one is disaster. 1/3 = 33%.
>>>>>
>>>>>>>Now the absolute speed from performance viewpoint. If the system idles
>>>>>>>completely and then starts to run *exclusively* diep at 4 processors, then
>>>>>>>the measured speedup as you can calculate is in the order of 11.4% for
>>>>>>>SMT/HT.
>>>>>>>
>>>>>>>That's not so much actually. The loss by searching parallel is at most
>>>>>>>parallel applications bigger than the win of 11.4%. In case of DIEP
>>>>>>>i am on the lucky side and go for that 11.4% faster speed.
>>>>>>>
>>>>>>>Yet the sad confirmation is that the pessimistic expectation about the
>>>>>>>absolute speed is completely confirmed. This system performs (assuming
>>>>>>>lineair scaling) like a 1.98 Ghz dual K7.
>>>>>>
>>>>>>If memory is a big issue for Diep, it probably won't scale linearly as memory
>>>>>>never does.
>>>>>
>>>>>It's a bigger issue for crafty than for DIEP. I hope you realize that
>>>>>this diep version is from 25 august 2002, that beta version runs pretty ok
>>>>>at cc-NUMA machines as well.
>>>>>
>>>>>Crafty doesn't though.
>>>>>
>>>>>>>there are motherboards now which do not require registered memory and
>>>>>>>the K7 runs already quite a while at 2.0Ghz in fact. Now i don't care
>>>>>>>for XP at all here nor do i care for the P4 at all. I just care for
>>>>>>>parallel search here.
>>>>>>>
>>>>>>>If we know that a 2.0Ghz dual K7 is identical to a dual 2.8Ghz Xeon
>>>>>>>and that in the majority of cases the K7 is going to win, then considering
>>>>>>>the huge price difference, the choice would be trivial for most who
>>>>>>>are looking for a lot of computing power for little money.
>>>>>>
>>>>>>AMD has always been better price/performance. Before the huge price differences
>>>>>>in AMD and Intel chips, the AMD chips meant your old Socket 7 board could be
>>>>>>used through ~500 MHz.
>>>>>
>>>>>>>Doesn't take away the fact that the P4 is winning ground. I remember
>>>>>>>the first dual AMD 1.2ghz test versus P4 dual 1.7Ghz and the AMD dual
>>>>>>>being 20% faster. Meaning in short that the speed of a P4 was performing
>>>>>>>about 1 : 1.7
>>>>>>>
>>>>>>>Now if i compare a dual Xeon 2.8Ghz with a 2Ghz K7 then it's equal
>>>>>>>meaning the P4 is performing 1 : 1.4
>>>>>>>
>>>>>>>So that's a big step forward!
>>>>>>
>>>>>>Well just about every application saw a similar gain from the 512 KB cache
>>>>>>Northwood from the 256 KB cache Williamette. The new Xeons, as I understand,
>>>>>>have 1 MB L3 cache in -addition- to the other caches. Don't quote me there. All
>>>>>>I know is that things changed. The extra cache makes the P4 competitive whereas
>>>>>
>>>>>It's the DDR ram that speeded DIEP and crafty up a lot. Not the bigger
>>>>>cache so much.
>>>>>
>>>>>DDR ram has nearly 2 times faster latency than RDRAM.
>>>>
>>>>You seem so sure, but you never tested a Northwood on RDRAM or a Williamette on
>>>>DDR SDRAM to know.
>>>
>>>I tested many P4s with RDRAM and they were very slow.
>>>
>>>It is trivial that if the latency is 2 times slower that this has
>>>a major impact onto things like hashtables (assuming the same
>>>processor gets put in the machine, because obviously a cpu matters
>>>way more than a bit faster ram).
>>>
>>>It's like creating an obstacle.
>>>
>>>there is eval hashtable. there is transposition hashtable. there is
>>>pawn hashtables. etcetera in diep.
>>>
>>>that doesn't fit *ever* in L2 cache at all.
>>>
>>>A big L2 cache matters basically when you start getting parallel.
>>>
>>>For a single cpu speed the L2 cache matters nothing at all.
>>>
>>>The initial tests that i did rdram versus ddr ram indicated 1.7 versus 1.5
>>>
>>>with SMT that's 1.4 now.
>>>
>>>Best regards,
>>>Vincent
>>
>>L2 matters a lot, just not in your hash probes. A hash probe is subject to
>>latency. Not everything is.
>
>L2 matters a lot except if you already have a lot of it. I won't
>say that 256KB L2 cache + 128KB L1 cache = 384KB cache is enough,
>but it is a good step in the right direction.
>
>Increasing from the K7 the L2 cache to 512KB won't matter that much.
>Simple as that.
>
>Of course dual it will matter a lot more, but still it will be
>minor for diep compared to changing the K7s cpu clock speed.

Mm'hmm, and what about for general applications? Are you talking about chess
programs or general applications?

It is one thing to say that large L2 cache is useless for chess programs. It is
another to say that it's useless for all programs.

The same thing goes for HT.

>I remember how they had clocked down the L2 cache from a K7 running
>at 1Ghz. AMd wanted to be the first to hit 1Ghz and they DID it.
>
>They managed to clock something to 1Ghz. If i remember well december 2000
>it was in the shops in USA.
>
>Despite that the entire L2 cache was clocked down to 1/3 speed or something
>it was still very fast cpu for me!
>
>L2 cache can make you or break you, but when there is a good L2 cache
>and it's big then it's trivial that an even bigger L2 cache won't
>increase performance that much.

Williamette had 256 KB L2 cache. Northwood has 512 KB L2 cache. The difference?
Huge. Even Diep saw gains, as you have posted.

>>It is probably very tempting to make the jump from "HT performance is x% in
>>chess engines" to "HT performance is x% in applications." You can't say that.
>
>You forget a crucial thing. The only applications i run that eat system
>time are usual chess programs. So for me that conclusion is justified, just
>like for majority here.

That doesn't matter. You can't claim it's true for all applications. To
criticize Intel for HT is to say that HT is useless in all applications. Say
what you mean.

>>Chess engines are not representative of all applications. Neither your memory
>>accesses nor access patterns are representative of typical software.
>
>It is not a coincidence that for HT i come down to 11.4% and that the
>database guys came to 11% for it.
>
>Claims of 20% from intel initially and now some of their spokesman crying
>something about 30% which simply isn't true for majority of applications,
>is another good show of how a few voices which exaggerate just too much
>can let something look better than it is.
>
>11.4% is not worth it simply.

The claim is 30-40% from Intel. It was 30-40% a long time ago.

I still fail to see how Diep is representative of all software.

I'm also curious as I remember you telling me that people spend months to get a
1-2% speedup. Now you're telling me that 11.4% isn't worth it?

>No it isn't a $1 investment of intel. They can produce only half the cpu's
>because the size of the P4 is so big. Now i know why it's so big. there's
>2 cpu's at each chip instead of 1 :)

Um...no. There are twice the registers. Everything else gets shared. P4 prior to
HT was 55M transistors. P4 with HT is not 110M transistors. That's rediculous. I
can't find an exact count, but I really doubt it's more than 57M or 58M
transistors with HT.

>>I still haven't seen conclusive data, either. You need to run controlled tests.
>>You can't just change a bunch of things and say, "Oh, it was that one that
>>caused this effect."
>
>i am not talking about light effects here nor am i talking about radiation
>effects on peoples health.
>
>Processor speeds you can measure very easily. This has nothing to do
>with convincing people. But with objective measuring. Objective measured
>i win 11.4% with SMT/HT at the 2.8Ghz Xeon. Other big tests indicated
>11% as you can read at nearly any site (don't go to intel that's the
>worst source of information of course). Many of those testers are even
>intel friendly. But they tested objective (some got amazingly quick a
>HT enabled cpu of them. all p4s i tested some months ago
>didn't have it working well at all!) and came down to less than 20%.

Ah, ok. All those years I spent in school brainwashed me to the truth. I can
make up bullshit if I run 2 tests. I was under the impression that data was
volatile and difficult to measure, and I was also under the impression that you
had to change one thing at a time to understand what was going on. Silly me.

I have a program that measures instruction timings. I get variance between
tests. This makes me extremely upset because it means something is wrong. I have
spent over a month tweaking the timing code to fix this variance. I haven't
figured out how to fix the systematic bias, either.

By this logic, my choice of action should be to run 1 test and say it's
conclusive.

>For diep at a single cpu P4 3.06ghz someone measured it around 18%
>when running it 2 processes there.
>
>If at first SMT doesn't give a speedup at all, then it gives at
>new 2.8Ghz Xeons a speedup of 11.6 and then it gives 18% at 3.06Ghz
>brandnew P4, then obviously it's a technique that gets slowly improved.

Actually they're the same chips.

>the right conclusion for SMT/HT is very clearly that it's less than 20%
>speedup. That i get 11.4% with a 25 august beta version and that someone
>else gets 11.0% and some other tester reports 18% speedup, it doesn't
>matter that much.

Clearly.

I'm going to take a poll here on the forums. I'm going to ask Eugene, myself,
Dr. Hyatt, and anyone else I can find who likes HT. I'm going to ask, "Is HT any
good?" We're all going to say yes. Then I'm going to come back here and post
that 100% of the people on this forum like HT.

>What matters is the actual total speed of such a machine. The total speed
>was measured very clearly for 4 processes after many minutes to be
>something which a K7 dual 2.0Ghz gets nearly instantly.

I posted results from Crafty on both of my dual-Athlons. Dr. Hyatt posted
results from his dual-Xeon 2.8 GHz with and without HT. My AthlonMP 2000 at work
still wasn't on-par with his Xeons without HT.

No.

I like AMD. The truth is that AMD is not always fastest. The K8 will have a
decisive advantage over the Pentium 4. I'm waiting for the K8, but in the
meantime, I'm not going to go around making silly claims.

>If we compare that, then it's trivial that SMT has a long way to go
>before it is grown up.
>
>Most likely because of that it wasn't integrated in McKinley yet. They
>couldn't afford at a serious cpu to do something that only makes the
>kids happy!
>
>>-Matt

Most likely it's because it won't help Itanium much. It's dubious whether HT
will help IA-64 at all. To realize why, you have to understand IA-64 and
understand HT.

The goal of HT is to fill unused execution units.
The goal of IA-64 is to minimize the unused execution units at compile-time so
the extra on-chip logic can be devoted to doing useful work.

See why it doesn't work?
The fact that IA-64 is VLIW would complicate the HT logic, and it breaks away
from the VLIW paradigm of computing.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.