Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Hyper Threading and Chess

Author: Matt Taylor

Date: 23:41:59 12/31/02

Go up one level in this thread


On January 01, 2003 at 01:44:37, Vincent Diepeveen wrote:

>On January 01, 2003 at 00:21:15, Matt Taylor wrote:
>
>>On December 31, 2002 at 23:08:09, Vincent Diepeveen wrote:
>>
>>>On December 31, 2002 at 13:57:13, Matt Taylor wrote:
>>>
>>>>On December 31, 2002 at 11:49:31, Vincent Diepeveen wrote:
>>>>
>>>>>On December 30, 2002 at 22:32:52, Robert Hyatt wrote:
>>>>>
>>>>>>On December 30, 2002 at 20:29:11, Vincent Diepeveen wrote:
>>>>>>
>>>>>>>On December 30, 2002 at 19:39:23, Frank Koenig wrote:
>>>>>>>
>>>>>>>>Two questions.
>>>>>>>>
>>>>>>>>One) Will Intel's HT technology be able to help chess programs above and beyond
>>>>>>>>just allowing one CPU to appear as two?
>>>>>>>>
>>>>>>>>Second) If you are running XP, will HT require XP Pro instead of XP Home to take
>>>>>>>>advantage of it?
>>>>>>>>
>>>>>>>>Thanks,
>>>>>>>>
>>>>>>>>Frank
>>>>>>>
>>>>>>>For dual machines you need even newer releases of OSes to still get
>>>>>>>released.
>>>>>>>
>>>>>>>However you can profit from it in a very limited way. It's a speedup of
>>>>>>>18% for DIEP at the latest P4 (3.06Ghz), at older P4s the profit is less
>>>>>>>(like P4 Xeon 2.8Ghz) and even older P4s the profit is zero or negative.
>>>>>>
>>>>>>Any chance you will _ever_ "test before talking"?
>>>>>>
>>>>>>The 2.8 xeon has the _same_ SMT core as the PIV/3.06.  The _same_ means
>>>>>>"the same", not "something that is not as good as."
>>>>>
>>>>>http://www.realworldtech.com/index.cfm  and ask intel designers themselves.
>>>>>
>>>>>>That is simply a crock statement that is nonsense.  From _testing_ on
>>>>>>my part...
>>>>>
>>>>>I see a clear difference in performance. Intel managed to slowly improve SMT
>>>>>to what it is now. I do not find 18% impressive knowing the chip is already
>>>>>that much slower than the K7 for me.
>>>>
>>>>Um, they don't "fix" things that quickly. The Hyperthreading in the Pentium 4
>>>>3.06 GHz is the same Hyperthreading in the Xeon 2.2, 2.4, 2.53, and 2.8 GHz
>>>>chips.
>>>
>>>Look at my questions there posted and the answer from some major experts
>>>there. Many work either for AMD or intel at processor design and they
>>>mention that intel improved SMT slowly to what it is now. And it *does*
>>>speed me up now. That's more than i had expected after they started
>>>drumming with it a few years ago.
>>
>>Yes, but the HT in Xeons is the same HT in the P4 3.06 GHz. The intermediate
>>steps were never released to the public. (Actually the chips probably had HT,
>>but the BIOS vendors were not allowed to enable it.)
>>
>>>>You still seem to be hung up on ipc. The K7 had better ipc than the Pentium
>>>
>>>for me IPC = inter process communication.
>>>
>>>Which ipc do you refer to?
>>
>>As a programmer, IPC = interprocess communication. In the context of CPUs and
>>hardware, IPC = instructions per cycle.
>>
>>>>Pro/2/3. The Pentium 4 is not designed to be efficient with work done per clock.
>>>
>>>Exactly and i find that very sad. However with exception of crafty in
>>>specint, which is a good thing to be in specint, it seems that work done
>>>each processor clock is no big issue. Just L2 cache speed and bandwidth
>>>and some technical details i probably didn't guess so well yet.
>>
>>If you own a car with a 5-gallon tank that gets 50 mpg (miles per gallon) and a
>>car with a 15-gallon tank that gets 25 mpg, which travels further? Efficiency is
>>not necessarily an indicator for raw performance.
>>
>>>>It is designed to ramp to high clock speeds, something the K7 will not do. The
>>>>K7 is nearing the end of its lifetime -- it's running more than 4 times faster
>>>>than its introductory speed.
>>>
>>>I got impression the initial released core was a different one from the
>>>tbird and the athlon MP 1.2 Ghz, later called XP and MP, to be a completely
>>>new recalculated core again.
>>>
>>>probably a lot being the same, but to me it definitely was a different chip.
>>>
>>>Just like the new P4s seem completely different to me from the older P4s
>>>too.
>>
>>Then you are mistaken. The cores have only been tweaked from their original
>>designs. One example -- the Pentium 3 was a Pentium 2 with SSE and clocked to
>>higher speeds. The new P4 is the original P4 with a bigger cache. The designs
>>themselves are identical; a few timings differ, but nothing terribly
>>significant.
>>
>>I have timed sequences of code within 0.1% execution speed between the original
>>Athlon and the AthlonMP (Palomino) chips in my SMP box. When compared to the K6
>>iterations or the P6 core (Pentium 2/3), there is a big difference.
>>
>>>If something has a good name it would be dumb that an improvement to it
>>>is released under a completely different name.
>>
>>They are both released under the name "AMD."
>>
>>If you hadn't noticed, AMD is starting to brand their processors like Intel
>>does. The palomino chips became AthlonXP. Both steppings of Thoroughbred cores
>>are also called AthlonXP. The Clawhammer (K8) is Athlon64.
>>
>>>Also unless they can produce that hammer real cheap, the higher clocked
>>>0.13 K7s will definitely kick butt if they reach 3Ghz too, whereas the
>>>P4 at 0.13 won't go above 3Ghz much.
>>
>>WHAT?! The Pentium 4 is destined for 5 GHz.
>
>You speaking of 0.09 micron i assume and not about 0.13 micron here.
>
>for now i have seen nothing 0.09 micron so i assume that for at least
>a year and some more the 0.13 micron technology will be the standard.

Intel planned to take the P6 core (Pentium Pro/Pentium 2/Pentium 3) to somewhere
past 1 GHz. If I had asked you then, you probably would have said 1 GHz is
rediculous. Intel plans to take the P7 core (Pentium 4) to around 5 GHz. They
can do this because Intel realizes that, over time, the manufacturing processes
will change, and the smaller dies will allow them to clock higher.

>That means that if k7s get near 3Ghz and the P4s go near the expected
>limit of 3.5ghz for 0.13 technology that the 50% difference in ipc
>(cpu hardware ipc) will give the K7 a major advantage in speed
>for computerchess.

The K7 isn't going to hit 3 GHz. The K7 has hit the end of its lifetime. The K8
will take over (though it is really the same chip with a few tweaks). Either
way, if Intel starts losing, they'll just ramp up the Pentium 4 clock speed
sooner than they had hoped.

Also keep in mind that for a while, Intel was using 0.13 microns while AMD was
still stuck at 0.18.

>>Also, the last K7s will be released within a few months. (Check AMD's processor
>>roadmap.) The Barton core puts the long-awaited 512 KB L2 cache on K7. Guess
>>what? Barton will also be called AthlonXP.
>
>>Clawhammer will be roughly the same price as an AthlonXP chip. This is mostly
>>due to the fact that a K8 is really only a tweaked K7. The K6 and K7 have much
>>less in common than the K7 and K8 (which have nearly identical pipelines).
>
>That document someone posted the url from says the K8-opteron
>has a longer pipeline than the k7. If that's true it means it can
>get very high clocked also in 0.13

The integer pipeline is 12 stages instead of 10. The floating-point pipeline has
not changed. Overall, the pipeline is roughly the same. I haven't seen details
on the new stages, but they probably involve some of the 64-bit changes and
don't assist the processor in clocking higher.

>the k8-opteron gets clocked then easily higher than the K7 i assume
>which would mean it gets to the same speed like the P4 (in 0.13).
>
>Of course the thing intel will manage is to produce 0.09 sooner than
>AMD (history at least indicates that) so we will see how the race
>continues.
>
>In all those processor competition what is real sad is that new released
>cpu's take months more to reach europe.
>
>When i read about a release of a 2.8Ghz Xeon P4 and i phone all computer
>companies here whether they can get one i get a 'no' for months. Idem
>3.06Ghz P4. In the USA somehow these products are sooner available.
>
>I always wonder why. Some processors also get produced in germany!

AMD's main plant is in Dresden.
That's marketting for you...

>>>>>>>So it's progressing but the P4 is a processor not really mature enough:
>>>>>>>too little trace cache and too little datacache: just 1024 quadwords;
>>>>>>
>>>>>>
>>>>>>So?  12K micro-ops.  8kb data.  Core-speed L2 cache with 512KB unified
>>>>>>cache.  Seems to work quite well in all the testing I have done.
>>>>>
>>>>>If it is in theory simply 2 processors then 11% at older types and 18% at
>>>>>new P4 3.06ghz is not much and because of the small L1. Also i didn't
>>>>>figure out yet how big the branch prediction table (BTB) is in the P4
>>>>>but it probably isn't so impressive.
>>>>
>>>>The BTB size doesn't affect that much. It is comparable, but I never concerned
>>>>myself with useless details.
>>>
>>>For many chessprograms the BTB up until like 2000 entries is very
>>>important. Get the best compiler you can get for crafty and compare its
>>>specint version (so no inline assembly getting used) at a P3 with 512
>>>BTB versus 2048 BPT from K7.
>>>
>>>You'll see a *huge* difference in speed.
>>
>>You'll also see a huge difference in speed if you run good code that avoids
>>branching. I've stated a number of times here that there are many techniques to
>>eliminate branches -- both in high-level code and at the assembly level.
>
>>The BTB size really only yields gains in code that:
>>(1) can't avoid branching
>
>==> computerchess is just about making choices in fact. the other
>    instructions are 'extra' heuristics to help you to decide which
>    branch.

I don't think you understood what I meant. I meant an inability to prevent a
control flow change. I did not mean an if statement, a loop, a function call,
etc. I meant an unavoidable mispredict from static prediction.

In many cases as I have repeatedly said, branches can be eliminated in
computation.

>>(2) uses indirect/conditional branches fairly often
>>(3) branches in a predictable manner
>
>==> chessprograms with big evals have many 'seldom' patterns which can
>    cause a big speedup when used in combination with a compiler that
>    can predict them and reorder the branching code (in assembly that's
>    all a lot simpler).
>
>    This is a thing i really feel a compiler should give as an extra
>    to persons writing in C. A compiler directive somehow that this
>    branch most likely is not goign to get taken and should get mispredicted
>    by default.
>
>    The compiler that allows this i would kiss its ass so to speak. Gives
>    me 20% speedup just like that. It is one of the main things which
>    makes assembly faster than C.
>
>    The branch reorder passes, most people forget it here, take very long
>    to compile. Up to 30 minutes is no exception.

Unfortunately the only way (right now) to give the compiler a good idea about
branch predictions is to use profiling.

>>>>The small L1 is not the reason why the Pentium 4 with Hyperthreading only gets
>>>>11% or 18% when performance fairy decides to crank it up a notch. (Look for the
>>>>performance fairy enable option in your BIOS.) The Pentium 4 has 512 KB of L2
>>>>cache -- more than any variant of the K7 has in total. L2 is not as fast as L1,
>>>>but it doesn't make a huge difference because it's a lot faster than main
>>>>memory.
>>>
>>>For nearly all specint programs which focus upon that cache of course it
>>>matters a lot that 256 to 512.
>>
>>Ok, but since when do chess programs focus on the cache?
>
>the IPC is dominant for chessprograms. there is just 1 chessprogram in
>the specint by the way. There is a shitload of other programs there that
>is just busy with cache.

I'm going to say "no" and leave it at that.

>>>If you have a very weak spot in a processor and then have a bigger L2 cache
>>>to catch up a bit of the problems, then that obviously works.
>>>
>>>>The real reason why the data changes from 11-18% across processors is because
>>>>you aren't accurately benchmarking. You can't run one test and call it
>>>>conclusive.
>>>
>>>You do not seem to understand a few things.
>>>
>>>Crafty splits at random and when a processor has to abort itself
>>>it splits near the leafs more. Splitting can be very expensive and
>>>entire root is kept locked while doing so.
>>>
>>>That creates a very unstable form of parallellism.
>>>
>>>DIEP is more stable in this sense:
>>>  - splitting more near the root
>>>  - hardly overhead for splitting and doing a minimum
>>>    of locking.
>>>
>>>A few tests which all were within a few tenths of percents definitely
>>>mean *something* then.
>>
>>Yours weren't -- 11% to 18%. Also, I don't believe I mentioned Crafty or Diep at
>>all.
>
>The tests were showing a clear direction. We can discuss in length
>whether it is 11% or 11.3 or 11.4 or perhaps internet connection caused it
>to be 11.7.
>
>This is not important.
>
>The important thing for me is the overall picture and that's that SMT
>speeds up something nowadays and it isn't much.

Not if you're expecting 100%. 10-20% is just fine when it doesn't cost a dime
more. Personally I think I will take HT over no HT.

>>>I dare to mention that my tests are 100 times more objective than the
>>>specint numbers you can see from several programs at the different
>>>spec tests as those numbers are generated after the compiler teams
>>>of the different manufacturers.
>>>
>>>You can stand on your head and claim anything but those numbers are
>>>very representative which i measured and inline with other results done
>>>by many testers!
>>
>>Which are conveniently documented nowhere, not run multiple times, and not
>>reported with detailed configuration information. That really doesn't matter,
>>though. I've gotten less variance polling political opinions.
>
>I am not a scientist who is there in life to waste his time onto
>documenting things.
>
>I document my parallel speedups at the supercomputer very well. You can
>blame me if i document them wrong. You can demand anything there and blame
>proofreaders if they do something wrong.
>
>I hope you realize that i would find it disgusting if you doubt
>that the speedup for me is actually 50% or 0% instead of the < 19%
>diep was tested to.
>
>>>The important thing with these tests is not whether i measure 18% or 11%
>>>or 22% or 17%.
>>>
>>>The importance is that it's not near a 2 times speedup. The importance
>>>for me is that it is *trivial* that even the cheap outdated K7 which
>>>soon is clocked to 3Ghz too, that this is toasting the P4 alive for
>>>DIEP. Even when i use the SMT.
>>
>>Who cares? HT isn't SMP or it wouldn't be evolutionary. HT costs pennies more to
>>produce and sports a nice performance/price ratio.
>
>>Also FYI the K7 will never hit 3 GHz. Even if it did, the P4 would be at 4 GHz
>>by the time the K7 hit 3 GHz. Comparing clock-to-clock between P4 and Athlon is
>>useless and rediculous.
>
>Not at 0.13 and it isn't ridicioulous.

It is rediculous because they have never competed at the same clockrate. The P4
will scale much higher in clock frequency than the Athlon. The correct
comparison is high-end part to high-end part.

If the Athlon hits 3 GHz and the Pentium 4 hits 6 GHz, which will be the faster
processor? Athlon will still have higher IPC, but it's not going to be faster
than the Pentium 4.

>>>We didn't talk about speedup yet. Let's be clear here. Bob didn't
>>>start about it yet.
>>>
>>>But please keep in mind next.
>>>
>>>Bob reports something like 30% speedup?
>>>
>>>Even his own formula (which i disagree with because it is
>>>impossible to have a 0.n speedup for each additional processor
>>>it gets less and tests from GCP at 30 positions indicated 2.8
>>>speedup out of 4 processors whereas the constant formula gives
>>>3.1) which gives speedup = 1 + 0.7(n-1) means the effective
>>>speedup of crafty because of SMT is a lot less.
>>>
>>>Did you realize that?
>>
>>Actually overheads are approximately linear. Not entirely, but approximately.
>
>Wrong.
>
>>>To do the math for his dual Xeon running 2 threads how many processors do
>>>we have netto?
>>>  2 n  ==> (1.7 / 2)  * 2 = 1.7 speedup to 1 thread
>>>
>>>Now if he has 4 processors getting 30% more nodes a second in total
>>>than 2 processors do that gives
>>>  2 * 1.3 = 2.6 ==> (3.1 / 4 ) * 2.6 n = 2.015 speedup
>>>
>>>effective win by SMT in the Hyatt article from a few years ago:
>>>  2.015 / 1.7 = 18.5%
>>
>>I see a lot of numbers, but I really don't follow that logic.
>
>>>If we do the same math for GCP measured numbers at positional testsets
>>>then it is at P3 xeon (and at faster processors it is even worse for
>>>the 4 processors compared to 2 processors
>>>of course as memory plays a bigger role there then):
>>>   2 threads: 1.6
>>>   4 threads: 2.8/4 * 2.6 = 1.82
>>>effective win by SMT ==> 14%
>>>
>>>In current DIEP i get 1.9 speedup at 2 processors
>>>and 3.6 speedup at 4 processors
>>>
>>>That's of course a way better speedup than crafty, but do the
>>>same math for SMT like i did a few months ago which caused me to
>>>get very unhappy about SMT.
>>
>>You were unhappy because you were expecting a massive performance boost. It's
>>not about leaps and bounds. It's about cheap design changes that pull another 5%
>>here and 10% there.
>
>The drumming onto details from big manufacturers i find very sick
>always. If they have a new toy which makes their product less worse
>for me then that's cool. If they drum it is going to 'revolutionize'
>that particular cpu then i disagree because it's still slower for me
>than the alternative.

Every manufacturer does that. AMD does that with Hypertransport. I don't recall
Intel claiming HT would revolutionize microprocessors, but certainly it has
achieved some degree of success.

>>>>>>>compare with the 64KB L1 data cache of a K7 which is i guess 16384
>>>>>>>doublewords.
>>>>>>
>>>>>>
>>>>>>what is with all the quadword/doubleword nonsense?
>>>>>
>>>>>>I think _most_ here can figure out what 64 KB turns into in your favorite
>>>>>>data size...
>>>>>
>>>>>64KB of K7 and just 1024 words of P4.
>>>>>
>>>>>The P4 is using 64 bits adressing for the L1 that means just 1024 words.
>>>>>I prefer personally 16384 words of 32 bits.
>>>>>
>>>>>However the P4 doesn't deliver 2048 words of 32 bits. It delivers 1024
>>>>>words of 64 bits.
>>>
>>>>Um, what? Xeon has used 36 bits for L1 and L2 address tags since the days of the
>>>
>>>>Pentium Pro because of the PAE/PSE36 addressing extensions. The chips run on a
>>>>36-bit address bus, not a 64-bit address bus.
>>>
>>>This info i had received by email long before the P4 was introduced and
>>>before they released design documents about it. A lot of data i have
>>>here is outdated
>>>usually when the processor then gets introduced. That is sad such a big
>>>disinformation caused by everyone who wants to publish about it. I would want
>>>to look this up though. 36 bits looks very silly to get from P4 processors L1
>>>or L2 cache as i work with 32 bits variables and when talking about node
>>>count it is 64 bits, but definitely never 36 bits.
>>
>>Read about PAE. Dr. Dobbs Journal has a nice article about it which has been up
>>for a few years. It's been on every processor since the Pentium, but only Xeon
>>supports it. The other mechanism Intel uses is PSE36. Applications still only
>>get 4 GB of memory, but 36-bit addressing means you can have 64 programs in the
>>system each eating 4 GB of memory. In the case of Windows, with tricks similar
>>to old EMS memory you can see more than 4 GB of memory.
>
>NT kernel can see 3 GB for each process AFAIK and by default it is set to 2 GB.
>
>>-Matt

The NT kernel has a 1 GB footprint for Server or Advanced Server (I am not sure
where that begins). All workstation versions have a 2 GB footprint. Note that
this is not actual ram consumed -- hardware buffers (including the massive 128
MB on my video card) are mapped into every process's address space.

Using AWE (address windowing extensions), a process can allocate physical memory
without mapping it into linear (visible) memory. An application can then
allocate memory up to the limit of the machine. It is annoying to use as any
addressing extension system is because it requires programming tricks to flip
pages. That is really the best argument in favor of a 64-bit machine right now
-- memory addresses can be greater than 32-bits and thus those paging games
don't need to be played.

This is not "AFAIK," this comes after much reading about the NT kernel,
developing on the NT kernel, and even looking at NT kernel source.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.