Author: Vincent Diepeveen
Date: 18:47:22 12/17/02
Go up one level in this thread
On December 17, 2002 at 13:14:56, Matt Taylor wrote:
>On December 17, 2002 at 12:30:40, Vincent Diepeveen wrote:
>
>>On December 17, 2002 at 11:59:48, Matt Taylor wrote:
>>
>>>On December 17, 2002 at 11:33:36, Vincent Diepeveen wrote:
>>>
>>>>On December 17, 2002 at 11:27:18, Matt Taylor wrote:
>>>>
>>>>>On December 17, 2002 at 10:10:46, Vincent Diepeveen wrote:
>>>>>
>>>>>>Hello,
>>>>>>
>>>>>>Some tests were performed in the USA, where some P4 Xeon dual 2.8Ghz
>>>>>>systems get delivered now. In Europe we can't get them yet and
>>>>>>most likely we don't want them either:
>>>>>>
>>>>>>Here are the results of DIEP at the Xeon 2.8Ghz dual ECC registered DDR ram.
>>>>>>
>>>>>>test 1: diep 4 processes. Of course HT enabled.
>>>>>> 181538 nps
>>>>>>
>>>>>>test 2: diep 2 processes. HT enabled.
>>>>>> 135924 nps
>>>>>>
>>>>>>test 3: diep 2 processes K7 1.6ghz (registered DDR ram all other settings
>>>>>> identical to xeon dual setup):
>>>>>> 146555
>>>>>>
>>>>>>THE 2 TESTS NALIMOV DIDN'T OR COULDN'T WANT TO DO WITH CRAFTY
>>>>>>SOME WEEKS AGO REVEAL A BIG WEAKNESS OF HT/SMT:
>>>>>>
>>>>>>test 4: diep 2 processes. HT disabled. 171288 nps
>>>>>>
>>>>>>test 5 and 6: diep single cpu HT disabled and enabled were same speed
>>>>>> 92090 nps versus 92019 nps.
>>>>>
>>>>>Crafty gets better results with HT, but it's been optimized for HT. It just
>>>>
>>>>That hasn't been proven yet.
>>>>
>>>>there was no test done without HT and 2 processors as far as i know.
>>>>
>>>>Please read how i tested it.
>>>
>>>I'm pretty sure he did non-HT tests too.
>>>
>>>>>means you need a personal Intel engineer to make it blazing fast for people who
>>>>>plopped down $600 USD for a top-of-the-line Intel chip. Before long they'll
>>>>>start selling Intel engineers in local computer shops. Collect all 18...
>>>>
>>>>Crafty is doing 2 probes in 2 hashtables for example. Remove it and
>>>>improve it to 4 probes at 1 table (which is faster on both intel and
>>>>AMD anyway, but AMD profits more because its chipset is cheaper).
>>>>
>>>>>HT is a good idea, and it works in practice rather than just on paper. It just
>>>>>doesn't work for -everything-.
>>>>
>>>>in the factory they press 2 cpu's and put a single P4 sticker on it.
>>>>You pay a factor 2 more, but get something 11.4% faster. For databases
>>>>it was measured 11% rather than 11.4%.
>>>>
>>>>That's what i call a bad buy!
>>>
>>>CPUs since the Pentium have been pipelined. The goal is to spread the work out
>>>so you can get a throughput of at least 1 op/cycle. Not always possible,
>>
>>1 op/cycle is very bad.
>>
>>Already when the pentiumpro 200 existed the average C program ran at
>>1.76 instructions a clock.
>>
>>So 1 iop/cycle is very bad then.
>>
>>Considering that a K7 when calculated back to the pentiumpro200 is hell
>>of a lot faster for each processor clock, then you will realize clearly
>>that it's not worse than 1.76 now.
>>
>>Note that the 1.76 number wasn't measured by me. I have no idea how i
>>measure this for DIEP otherwise i would know.
>
>The Pentium Pro executed an average of 1.76 micro-ops/cycle. There is a big
>difference between an average speed and an instantaneous speed. If I travel 10
>km/h for the duration of 1 hour and then remain still for 9 hours, I had an
>average speed of 1 km/h. I think the difference here is obvious enough that it
>warrants no further discussion.
>>I see however that processors like McKinley which can do more instructions
>>a clock (6 a bundle) that it achieves way faster speeds for DIEP than
>>any other processor. No being 64 bits has nothign to do with that.
>Itanium is designed for high IPC. They don't have to ramp the clock speed to
>make it fast. It is silly to criticize the Pentium 4 for low IPC when it is
>-designed- for it. It's like criticizing a Geo Metro for being slow. Most people
>will scratch their heads for a second, look at you funny, and then say, "Duh?"
No one will agree with you here. For the speed they can reach with it
it is very crucial what IPC you get. Simple as that. If you get to
2.8 Ghz and get 0.00000000000001 IPC then you can't say "that's the goal".
You want to be competative so you want to execute stuff faster.
If you execute a program faster then that's a combination of how much
Ghz of horse power you throw into it and the IPC you get.
>In the case of Pentium 4, it seems paradoxial to achieve greater speeds with
>lower IPC, but it is the strategy Intel picked. It is useless to criticize
>further without actual facts (such as the fact that Pentium 4 is intended to hit
>5 GHz).
Fact: cheap K7 beats P4 for DIEP.
Proven over and again. Even the latest release of the Xeon at 2.8Ghz
is getting kicked by a 2Ghz MP in case of DIEP. That's *real bad* for
the P4 if you consider its a x86 processor.
Note that shouting 5Ghz for a 0.13 micron design is pretty weird.
How to ever get above 3.5Ghz with the 0.13micron P4 with air cooling?
I have to see that first before i believe it!
with 3.06ghz the P4 is very close to its end. The K7 will be clocked
till 3Ghz of course too. AMD just needs a bit more time to get new
technology to work it seems than intel needs.
>>In contradiction to move a 64 bits word at the McKinley requires more
>>power than moving a 32 bits word at a K7.
>>
>>Diep is a 32 bits program so if i compile 'int' then if the compiler makes
>>that a 64 bits instruction for the mckinley that's not my worry.
>>
>>My happy feeling is the speed of it and it is doing very well. it is 33%
>>faster (a bad compiled executable with a cross compiler) for each Ghz of
>>a K7. 1.33Ghz K7 == 1.0 Ghz McKinley.
>
>That's not very good. The 1 GHz Itanium is the upper-end Itanium right now. Your
That's *world conquering* in fact. No other cpu gets even *close* to
the k7. Here is my list of performance for DIEP each Ghz,
starting with the worst:
Alpha 21164
Itanium1
P4
K6
P2
UltrasparcII
P6 pentiumpro
P3 coppermine (20% faster than P2)
IBM Power (latest one, sorry i always forget whether it's power3 or 4 or 5)
Alpha 21264
Sun Ultrasparc IIIcu
AMD K7 (above 1.2Ghz; not the old 1ghz K7s which performed worse)
R14000 (of course it is clocked 500Mhz which in 1999 was great)
McKinley/Itanium2
Do not forget in your mind the HUGE jumpt Mckinley made after the
horrible introduction of the itanium1. A McKinley is exactly THREE
times faster than an itanium1 (at the same speed measured of course
800Mhz versus 800Mhz).
McKinley 1Ghz costs about 7000$ a piece. Very cheap for the performance
a single 1Ghz processor delivers. Don't hesitate to call SGI if you want
to order a supercomputer with McKinleys inside. They can help you further.
It kicks any other cc-NUMA system!
Just look at the bandwidth these systems deliver, you'll faint!
The current TERAS already delivers 1 terabyte a second what do i need
to say more?
Also all those supercomputer cpu's are clocked around 1Ghz. Only IBMs
inferior thing is clocked to 1.3ghz but that 1.3Ghz doesn't make up
for it much.
Do not forget that the actual speed of the mckinley which i posted
here some time ago is based upon an old test with a cross compiler
which was very old and meant to run only for itanium1.
When i compile native itanium2 and do some branch optimizations
by using profile info, then it is a lot faster i bet!
Now if i may chose between
500Ghz McKinley or 500Ghz Ultra (note there is no SUN machines with
that many processors which you can call with some luck cc-NUMA) or 32
processor IBM Power (a single calculation unit out of 32 1.3Ghz
IBM Power processors is of course inferior to that 500Ghz)
Just compare. What is going to beat 500Ghz McKinley for DIEP?
*nothing*.
Only about the SUN processors i can say that they are underrated in
the computerchess world. Bob is always saying they suck, but they do not.
They are fine processors. But not *close* to McKinley performance!
IBM is just cheating too much at the specint tests, that McKinley is
really kicking butt there.
*nothing* kicks it. really nothing.
Even for those who want to do vector processing it is still performing
pretty ok with 6 instructions a bundle. Cray doing like 29 or whatever
but that Cray is very expensive for each cpuhour and you can get like
10 McKinleys for each cray processor or so *easily* within a single
partition.
Then just $7000 a piece. This is a buy!!
When i arrived at SGI short before world champs 2002 i was very afraid
for the speed of the R14000. Knowing it is a revised R12000 processor.
R12000 originally is a mips processor but the R14000 isn't redesigned
by mips at all but by DEC or NEC) i feared the worst when looking
at its small L1 cache. But it performed ok. Slightly faster than K7.
Perhaps 0.5% at initial testing (but that was a 32 versus 32 bits
test, like all the above tests are; nowhere i took in the tests
advantage of the 64 bits which i'll do for world champs 2003 though
at the McKinley.
So i was very amazed by the R14000 when i actually ran first tests.
I am not so impressed by the mips pro compiler however. Much more
impressed by itanium1 compiler on the other hand. a DEFAULT -O2
compile performing that well also at the McKinley. It's *incredible*.
I needed to test a long time before i had figured out which options
all hurted me in the mipspro compiler. I find it in general a very
bad thing if -O3 at a compiler runs you 5% slower than -O2.
What i miss or didn't figure out yet is how i can use profile info
for the mipspro to get my exe faster.
but to save you another hour of stories and compiler horrors,
it can't hide that the mckinley is a big winner cpu!
>1.33 GHz Athlon was slower than a 2.4 GHz Pentium 4 (which was equal to 1.6 GHz
>Athlon you said), and an 800 MHz Itanium has performed as well as a 3 GHz
>Pentium 4 for other people.
I get really the impression you don't see the difference between an
itanium2=mckinley=800Mhz,1Ghz (supposed to get released at
1.2Ghz too) and itanium1=800Mhz or slower
>>Now that's *without* branch optimization yet for the McKinley. I don't know
>>what speedup that will get for the mckinley but i compiled it for the
>>itanium1 do not forget that. i didn't optimize for itanium2 at all.
>
>Itanium 2 isn't out yet. McKinley is still Itanium 1. It's like the difference
>between Thunderbird and AthlonXP.
then it's time to buy intel stocks for you, that itanium2 is going to
kill away all other supercomputer chip manufacturers!
>>Nalimov probably has more details regarding this.
>>
>>It is clear to me at least that this itanium2 is a big winner.
>
>I would hope so. Ever since I read about the IA-64 architecture it was obvious
>to me that Intel engineers put tremendous thought into its design.
I do not know whether McKinley is 0.13 or 0.18 from head, nor do i know
how high one can clock a 64 bits chip anyway, but to me it's clear that
if anyone can clock a supercomputer chip above 2Ghz *ever* , then it's
going to be intel for sure.
If they ever manage to clock an itanium3, even with level2 and 3 caches
running at 1/2 of 1/3 speed, at like 2Ghz, and with a better form of
HT/SMT than the P4 has, then it's going to wipe away anything on
this planet.
But imagine what a 512Ghz McKinley supercomputer means actually.
What it *represents*. Not taking into account the massive hashtables
(because each year bigger RAM sizes get available too).
But assuming that you have a 2.6ghz K7 world champs 2003,
just from hardware viewpoint getting to that 512Ghz means you miss
a factor of 256 exactly (just assuming the data i have now that
1.3Ghz K7 == 1.0Ghz McKinley; though mckinley will be definitely
get a lot faster when DIEP is tuned for it).
2^8 = 256.
That means 2 * 8 = 16 years.
16 years is a lot!!
Of course software quality will get better in 10 years of time.
Branching factor will get better. Other things will get discovered
too (but no major search things. *no way*. perhaps more efficient
combinations of nullmove with hashtables).
So speed matters!
>>>particularly with complex instructions. Every CPU since then had adhered to
>>>superscalar designs.
>>
>>>The Pentium 4 is no different. It has an extremely long pipeline to enable it to
>>>clock to higher frequencies.
>>
>>Exactly what i fear yes.
>
>Why fear it? It's a natural progression. Branch prediction was -buggy- on a
>Pentium, but it didn't matter because a mispredict didn't carry a huge penalty
>with it. The Pentium can actually mispredict instructions that DON'T BRANCH.
>Next was the P6 core, the PPro/Pentium 2/Pentium 3. Deeper pipeline, but they
>had reordering and other stuff. Criticized for a deep pipeline. Athlon was
>deeper than the K6. Now they release the Pentium 4 and Intel gets scathed for a
>deep pipeline yet again. Why does it matter? They're still fast in most code
>because most code doesn't branch mispredict.
you forget a crucial aspect. the P3 and the K7 practical could execute more
instructions a clock than the previous generations of them could.
P3 was 20% faster than P2. P2 was slower than pentium pro, but basically
same core (L2 cache was clocked down and BTB bigger and L1 cache
bigger). Each newer generation was faster.
NOT WITH THE P4 !!!!!!!!!
the p4 is SLOWER.
I see it as a big marketing succes. Selling something slower for more
money and selling it as faster because it has a bigger number on it
(1.7Ghz instead of k7 had 1.2ghz to 1.4ghz).
So P4 is the actual confirmation again that the average person is very
dumb and goes for something with a bigger number on it.
I do not know whether it is possible to check at what speed a processor
runs internally. If not, then produce a processor called X8. Claim it
runs 4Ghz and you'll sell *very* well.
Of course also have your own compiler team to let it score well on
specint and let them crosspost to all nerds world wide that it's
2 times faster than any other processor because of a feature called
superspeed which executes a program a lot faster when it needs to
get executed faster than the other concurrent programs running.
that X8 will be sold *very* well then.
>I fixed some of my branches simply by changing the direction they jumped. A lot
>of it is as simple as that. If I mispredict once in a loop of 1,000 iterations
>(each branching), what's the difference?
here is the diep problem (but also other chessprograms have this):
if( general pattern ) {
if( pattern )
then evaluate
if( pattern )
then evaluate
if( pattern )
then evaluate
..
}
You see the problem is that each pattern is build up from simple elements.
So if the general pattern is taken then it will get a lot of
mispredictions in the patterns that it tries.
Let's estimate it at 30% of the patterns.
That's still a horrible amount of mispredictions as you get a lot of
them within a short period of time.
It would be a lie to say that diep's speed profile is much different
from other programs. More or less all the chessprograms tend to get
the same problems.
If a new processor gets released (32 bits) and single cpu crafty
is 20% slower at it than at a K7, then i know in advance that it is
very likely that i am 20% slower at it than at a k7.
Only parallel and at 64 bits things change of course. Crafty *flies*
at the McKinley.
>>> The bulk of this pipeline is shared for each
>>>"logical" CPU. They share caches, execution units, decoders, etc. The only thing
>>>that gets duplicated is the register set, a smaller part of the CPU.
>>>
>>>>>>First conclusion is that the system is profitting only from HT when you
>>>>>>use 4 processes at the same time, OTHERWISE IT IS A DISADVANTAGE IF
>>>>>>YOU MULTITHREAD, because see the big difference between 2 processes
>>>>>>running with HT turned on and off.
>>>>>>
>>>>>>In itself when you have a program with just 2 threads which you
>>>>>>run on a dual it gets slower. My assumption is that the hardware reports
>>>>>>4 cpu's and that the software doesn't care at what cpu to schedule
>>>>>>the processes/threads. the result of that is that there is a 33% chance
>>>>>>that things get scheduled at a cpu which is already running a thread/process.
>>>>>>
>>>>>>Resulting in a system where 1 cpu idles kind of shortly and 1 cpu is running
>>>>>>2 threads/processes.
>>>>>>
>>>>>>Actually the actual chance that the 2 processes are scheduled at
>>>>>>2 different processors (there is 4 processors for the OS
>>>>>>times 3 processors left for the second process is 12 different
>>>>>>schedulings) is: 8/12 = 2/3 = 66%. In short there is a disaster possibility
>>>>>>of 33%.
>>>>>
>>>>>Yes, when one thread is scheduled on one processor, there are 3 choices for the
>>>>>other thread, and one is disaster. 1/3 = 33%.
>>>>
>>>>>>Now the absolute speed from performance viewpoint. If the system idles
>>>>>>completely and then starts to run *exclusively* diep at 4 processors, then
>>>>>>the measured speedup as you can calculate is in the order of 11.4% for
>>>>>>SMT/HT.
>>>>>>
>>>>>>That's not so much actually. The loss by searching parallel is at most
>>>>>>parallel applications bigger than the win of 11.4%. In case of DIEP
>>>>>>i am on the lucky side and go for that 11.4% faster speed.
>>>>>>
>>>>>>Yet the sad confirmation is that the pessimistic expectation about the
>>>>>>absolute speed is completely confirmed. This system performs (assuming
>>>>>>lineair scaling) like a 1.98 Ghz dual K7.
>>>>>
>>>>>If memory is a big issue for Diep, it probably won't scale linearly as memory
>>>>>never does.
>>>>
>>>>It's a bigger issue for crafty than for DIEP. I hope you realize that
>>>>this diep version is from 25 august 2002, that beta version runs pretty ok
>>>>at cc-NUMA machines as well.
>>>>
>>>>Crafty doesn't though.
>>>>
>>>>>>there are motherboards now which do not require registered memory and
>>>>>>the K7 runs already quite a while at 2.0Ghz in fact. Now i don't care
>>>>>>for XP at all here nor do i care for the P4 at all. I just care for
>>>>>>parallel search here.
>>>>>>
>>>>>>If we know that a 2.0Ghz dual K7 is identical to a dual 2.8Ghz Xeon
>>>>>>and that in the majority of cases the K7 is going to win, then considering
>>>>>>the huge price difference, the choice would be trivial for most who
>>>>>>are looking for a lot of computing power for little money.
>>>>>
>>>>>AMD has always been better price/performance. Before the huge price differences
>>>>>in AMD and Intel chips, the AMD chips meant your old Socket 7 board could be
>>>>>used through ~500 MHz.
>>>>
>>>>>>Doesn't take away the fact that the P4 is winning ground. I remember
>>>>>>the first dual AMD 1.2ghz test versus P4 dual 1.7Ghz and the AMD dual
>>>>>>being 20% faster. Meaning in short that the speed of a P4 was performing
>>>>>>about 1 : 1.7
>>>>>>
>>>>>>Now if i compare a dual Xeon 2.8Ghz with a 2Ghz K7 then it's equal
>>>>>>meaning the P4 is performing 1 : 1.4
>>>>>>
>>>>>>So that's a big step forward!
>>>>>
>>>>>Well just about every application saw a similar gain from the 512 KB cache
>>>>>Northwood from the 256 KB cache Williamette. The new Xeons, as I understand,
>>>>>have 1 MB L3 cache in -addition- to the other caches. Don't quote me there. All
>>>>>I know is that things changed. The extra cache makes the P4 competitive whereas
>>>>
>>>>It's the DDR ram that speeded DIEP and crafty up a lot. Not the bigger
>>>>cache so much.
>>>>
>>>>DDR ram has nearly 2 times faster latency than RDRAM.
>>>
>>>You seem so sure, but you never tested a Northwood on RDRAM or a Williamette on
>>>DDR SDRAM to know.
>>
>>I tested many P4s with RDRAM and they were very slow.
>>
>>It is trivial that if the latency is 2 times slower that this has
>>a major impact onto things like hashtables (assuming the same
>>processor gets put in the machine, because obviously a cpu matters
>>way more than a bit faster ram).
>>
>>It's like creating an obstacle.
>>
>>there is eval hashtable. there is transposition hashtable. there is
>>pawn hashtables. etcetera in diep.
>>
>>that doesn't fit *ever* in L2 cache at all.
>>
>>A big L2 cache matters basically when you start getting parallel.
>>
>>For a single cpu speed the L2 cache matters nothing at all.
>>
>>The initial tests that i did rdram versus ddr ram indicated 1.7 versus 1.5
>>
>>with SMT that's 1.4 now.
>>
>>Best regards,
>>Vincent
>
>L2 matters a lot, just not in your hash probes. A hash probe is subject to
>latency. Not everything is.
L2 matters a lot except if you already have a lot of it. I won't
say that 256KB L2 cache + 128KB L1 cache = 384KB cache is enough,
but it is a good step in the right direction.
Increasing from the K7 the L2 cache to 512KB won't matter that much.
Simple as that.
Of course dual it will matter a lot more, but still it will be
minor for diep compared to changing the K7s cpu clock speed.
I remember how they had clocked down the L2 cache from a K7 running
at 1Ghz. AMd wanted to be the first to hit 1Ghz and they DID it.
They managed to clock something to 1Ghz. If i remember well december 2000
it was in the shops in USA.
Despite that the entire L2 cache was clocked down to 1/3 speed or something
it was still very fast cpu for me!
L2 cache can make you or break you, but when there is a good L2 cache
and it's big then it's trivial that an even bigger L2 cache won't
increase performance that much.
>It is probably very tempting to make the jump from "HT performance is x% in
>chess engines" to "HT performance is x% in applications." You can't say that.
You forget a crucial thing. The only applications i run that eat system
time are usual chess programs. So for me that conclusion is justified, just
like for majority here.
>Chess engines are not representative of all applications. Neither your memory
>accesses nor access patterns are representative of typical software.
It is not a coincidence that for HT i come down to 11.4% and that the
database guys came to 11% for it.
Claims of 20% from intel initially and now some of their spokesman crying
something about 30% which simply isn't true for majority of applications,
is another good show of how a few voices which exaggerate just too much
can let something look better than it is.
11.4% is not worth it simply.
No it isn't a $1 investment of intel. They can produce only half the cpu's
because the size of the P4 is so big. Now i know why it's so big. there's
2 cpu's at each chip instead of 1 :)
>I still haven't seen conclusive data, either. You need to run controlled tests.
>You can't just change a bunch of things and say, "Oh, it was that one that
>caused this effect."
i am not talking about light effects here nor am i talking about radiation
effects on peoples health.
Processor speeds you can measure very easily. This has nothing to do
with convincing people. But with objective measuring. Objective measured
i win 11.4% with SMT/HT at the 2.8Ghz Xeon. Other big tests indicated
11% as you can read at nearly any site (don't go to intel that's the
worst source of information of course). Many of those testers are even
intel friendly. But they tested objective (some got amazingly quick a
HT enabled cpu of them. all p4s i tested some months ago
didn't have it working well at all!) and came down to less than 20%.
For diep at a single cpu P4 3.06ghz someone measured it around 18%
when running it 2 processes there.
If at first SMT doesn't give a speedup at all, then it gives at
new 2.8Ghz Xeons a speedup of 11.6 and then it gives 18% at 3.06Ghz
brandnew P4, then obviously it's a technique that gets slowly improved.
the right conclusion for SMT/HT is very clearly that it's less than 20%
speedup. That i get 11.4% with a 25 august beta version and that someone
else gets 11.0% and some other tester reports 18% speedup, it doesn't
matter that much.
What matters is the actual total speed of such a machine. The total speed
was measured very clearly for 4 processes after many minutes to be
something which a K7 dual 2.0Ghz gets nearly instantly.
If we compare that, then it's trivial that SMT has a long way to go
before it is grown up.
Most likely because of that it wasn't integrated in McKinley yet. They
couldn't afford at a serious cpu to do something that only makes the
kids happy!
>-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.