Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Hyper Threading and Chess

Author: Robert Hyatt
Date: 22:39:24 12/31/02
On December 31, 2002 at 23:08:09, Vincent Diepeveen wrote:

>On December 31, 2002 at 13:57:13, Matt Taylor wrote:
>
>>On December 31, 2002 at 11:49:31, Vincent Diepeveen wrote:
>>
>>>On December 30, 2002 at 22:32:52, Robert Hyatt wrote:
>>>
>>>>On December 30, 2002 at 20:29:11, Vincent Diepeveen wrote:
>>>>
>>>>>On December 30, 2002 at 19:39:23, Frank Koenig wrote:
>>>>>
>>>>>>Two questions.
>>>>>>
>>>>>>One) Will Intel's HT technology be able to help chess programs above and beyond
>>>>>>just allowing one CPU to appear as two?
>>>>>>
>>>>>>Second) If you are running XP, will HT require XP Pro instead of XP Home to take
>>>>>>advantage of it?
>>>>>>
>>>>>>Thanks,
>>>>>>
>>>>>>Frank
>>>>>
>>>>>For dual machines you need even newer releases of OSes to still get
>>>>>released.
>>>>>
>>>>>However you can profit from it in a very limited way. It's a speedup of
>>>>>18% for DIEP at the latest P4 (3.06Ghz), at older P4s the profit is less
>>>>>(like P4 Xeon 2.8Ghz) and even older P4s the profit is zero or negative.
>>>>
>>>>Any chance you will _ever_ "test before talking"?
>>>>
>>>>The 2.8 xeon has the _same_ SMT core as the PIV/3.06.  The _same_ means
>>>>"the same", not "something that is not as good as."
>>>
>>>http://www.realworldtech.com/index.cfm  and ask intel designers themselves.
>>>
>>>>That is simply a crock statement that is nonsense.  From _testing_ on
>>>>my part...
>>>
>>>I see a clear difference in performance. Intel managed to slowly improve SMT
>>>to what it is now. I do not find 18% impressive knowing the chip is already
>>>that much slower than the K7 for me.
>>
>>Um, they don't "fix" things that quickly. The Hyperthreading in the Pentium 4
>>3.06 GHz is the same Hyperthreading in the Xeon 2.2, 2.4, 2.53, and 2.8 GHz
>>chips.
>
>Look at my questions there posted and the answer from some major experts
>there. Many work either for AMD or intel at processor design and they
>mention that intel improved SMT slowly to what it is now. And it *does*
>speed me up now. That's more than i had expected after they started
>drumming with it a few years ago.
>
>>You still seem to be hung up on ipc. The K7 had better ipc than the Pentium
>
>for me IPC = inter process communication.
>
>Which ipc do you refer to?

Instructions per Clock (cycle).  Just as he has consistently used it every
time he mentions IPC.

>
>>Pro/2/3. The Pentium 4 is not designed to be efficient with work done per clock.
>
>Exactly and i find that very sad. However with exception of crafty in
>specint, which is a good thing to be in specint, it seems that work done
>each processor clock is no big issue. Just L2 cache speed and bandwidth
>and some technical details i probably didn't guess so well yet.

There is _much_ "you didn't guess so well yet" IMHO.



>
>>It is designed to ramp to high clock speeds, something the K7 will not do. The
>>K7 is nearing the end of its lifetime -- it's running more than 4 times faster
>>than its introductory speed.
>
>I got impression the initial released core was a different one from the
>tbird and the athlon MP 1.2 Ghz, later called XP and MP, to be a completely
>new recalculated core again.


"I got the impression" is a resounding "proof" that it was done...



>
>probably a lot being the same, but to me it definitely was a different chip.
>
>Just like the new P4s seem completely different to me from the older P4s
>too.
>
>If something has a good name it would be dumb that an improvement to it
>is released under a completely different name.


Utter nonsense.  If they improve _anything_ it will become the pentium V,
and then the pentium VI.  _anything_ to change the name to a higher version,
to capture all those "gotta have the latest" folks.



>
>Also unless they can produce that hammer real cheap, the higher clocked
>0.13 K7s will definitely kick butt if they reach 3Ghz too, whereas the
>P4 at 0.13 won't go above 3Ghz much.
>
>>>>
>>>>>
>>>>>So it's progressing but the P4 is a processor not really mature enough:
>>>>>too little trace cache and too little datacache: just 1024 quadwords;
>>>>
>>>>
>>>>So?  12K micro-ops.  8kb data.  Core-speed L2 cache with 512KB unified
>>>>cache.  Seems to work quite well in all the testing I have done.
>>>
>>>If it is in theory simply 2 processors then 11% at older types and 18% at
>>>new P4 3.06ghz is not much and because of the small L1. Also i didn't
>>>figure out yet how big the branch prediction table (BTB) is in the P4
>>>but it probably isn't so impressive.
>>
>>The BTB size doesn't affect that much. It is comparable, but I never concerned
>>myself with useless details.
>
>For many chessprograms the BTB up until like 2000 entries is very
>important. Get the best compiler you can get for crafty and compare its
>specint version (so no inline assembly getting used) at a P3 with 512
>BTB versus 2048 BPT from K7.
>
>You'll see a *huge* difference in speed.
>

No you won't.  I don't think I have 2000 unique branch targets in the main
code of Crafty.  Note that is _not_ 2000 branches, but 2000 branch _targets_.



>>The small L1 is not the reason why the Pentium 4 with Hyperthreading only gets
>>11% or 18% when performance fairy decides to crank it up a notch. (Look for the
>>performance fairy enable option in your BIOS.) The Pentium 4 has 512 KB of L2
>>cache -- more than any variant of the K7 has in total. L2 is not as fast as L1,
>>but it doesn't make a huge difference because it's a lot faster than main
>>memory.
>
>For nearly all specint programs which focus upon that cache of course it
>matters a lot that 256 to 512.
>
>If you have a very weak spot in a processor and then have a bigger L2 cache
>to catch up a bit of the problems, then that obviously works.

That is simply hand-waving nonsense.  memory is a bottleneck.  Cache is a
solution to address that bottleneck.  It doesn't "make up for a weak spot in
a processor".  It addresses a huge speed differential between processor and
memory.



>
>>The real reason why the data changes from 11-18% across processors is because
>>you aren't accurately benchmarking. You can't run one test and call it
>>conclusive.
>
>You do not seem to understand a few things.
>
>Crafty splits at random and when a processor has to abort itself
>it splits near the leafs more. Splitting can be very expensive and
>entire root is kept locked while doing so.

That is simply nonsense.  I have given you data to show your "conclusions"
are wrong.  But until you understand my code, you won't get this.  But with
four processors the "smp-lock" is _not_ an issue.  and it is trivial to "proof"
it...




>
>That creates a very unstable form of parallellism.
>
>DIEP is more stable in this sense:
>  - splitting more near the root
>  - hardly overhead for splitting and doing a minimum
>    of locking.
>
>A few tests which all were within a few tenths of percents definitely
>mean *something* then.

Yes.  Conservative parallelism, _not_ "high performance parallel search."

>
>I dare to mention that my tests are 100 times more objective than the
>specint numbers you can see from several programs at the different
>spec tests as those numbers are generated after the compiler teams
>of the different manufacturers.
>
>You can stand on your head and claim anything but those numbers are
>very representative which i measured and inline with other results done
>by many testers!
>
>The important thing with these tests is not whether i measure 18% or 11%
>or 22% or 17%.
>
>The importance is that it's not near a 2 times speedup. The importance
>for me is that it is *trivial* that even the cheap outdated K7 which
>soon is clocked to 3Ghz too, that this is toasting the P4 alive for
>DIEP. Even when i use the SMT.
>
>We didn't talk about speedup yet. Let's be clear here. Bob didn't
>start about it yet.
>
>But please keep in mind next.
>
>Bob reports something like 30% speedup?

20% on one test, 30% on another, in terms of raw NPS, _yes_.
raw NPS is the right way to evaluate hyper-threading for a general
sense of how it will do.  parallel speedup is another issue altogether
and is _only_ applicable to computer chess, rather than to the general
question of "does SMT work for general applications?"

>
>Even his own formula (which i disagree with because it is
>impossible to have a 0.n speedup for each additional processor
>it gets less and tests from GCP at 30 positions indicated 2.8
>speedup out of 4 processors whereas the constant formula gives
>3.1) which gives speedup = 1 + 0.7(n-1) means the effective
>speedup of crafty because of SMT is a lot less.


I have never seen anyone so stubborn nor so ignorant.  As I have
said before, my "formula" is a "general estimate".  A real raw speedup
does _not_ apply to a general case.  It applies either to a specific test
position or a specific set of test positions.  It can vary from one set to
another.  And it can vary within a single test set.  I didn't make my numbers
up, and I have posted the raw data here _many_ times...

Unlike yourself, where you wave your hands, estimate this, talk about "If I
disable unsafe extensions" and other such nonsense and then you quote numbers
that are meaningless.



>
>Did you realize that?
>
>To do the math for his dual Xeon running 2 threads how many processors do
>we have netto?
>  2 n  ==> (1.7 / 2)  * 2 = 1.7 speedup to 1 thread
>
>Now if he has 4 processors getting 30% more nodes a second in total
>than 2 processors do that gives
>  2 * 1.3 = 2.6 ==> (3.1 / 4 ) * 2.6 n = 2.015 speedup
>
>effective win by SMT in the Hyatt article from a few years ago:
>  2.015 / 1.7 = 18.5%
>
>If we do the same math for GCP measured numbers at positional testsets
>then it is at P3 xeon (and at faster processors it is even worse for
>the 4 processors compared to 2 processors
>of course as memory plays a bigger role there then):
>   2 threads: 1.6
>   4 threads: 2.8/4 * 2.6 = 1.82
>effective win by SMT ==> 14%
>
>In current DIEP i get 1.9 speedup at 2 processors
>and 3.6 speedup at 4 processors

Produce some _real_ data.  I don't believe your made-up hand-waving nonsense
for a minute.  Send me an executable and let me run the tests and post the
raw data.  It is easy to make claims when no one can test to show your stuff
is nonsense.

However, for fun, here is a set of four positions that are _worst_ case for
me in terms of parallel performance, overall.  I ran them with one processor
and with four (two real cpus with SMT on).

145.5 seconds with one cpu
80.23 seconds for 4 threads

real speedup = 1.81 on a dual xeon with SMT on.  I consider that reasonable.
and I consider it _verifiable_.  I can pick 4 positions that will produce a
speedup of 2.5 with 4 SMT threads if you want.  As I said, the above is a
known set of 4 bad positions, each run 4 times...

>
>That's of course a way better speedup than crafty, but do the
>same math for SMT like i did a few months ago which caused me to
>get very unhappy about SMT.

Be as unhappy as you want, _it works_.  I didn't expect 2.0X for a
single cpu with SMT on, nor did I expect nothing as you claimed when
you started your nonsense.  Then you claimed that the processors didn't
exist.  Then you claimed my xeons didn't have SMT.  Then you claimed that
only the 3.06 had SMT.  And _each_ time you were proven wrong.  As usual.

>
>>>>>compare with the 64KB L1 data cache of a K7 which is i guess 16384
>>>>>doublewords.
>>>>
>>>>
>>>>what is with all the quadword/doubleword nonsense?
>>>
>>>>I think _most_ here can figure out what 64 KB turns into in your favorite
>>>>data size...
>>>
>>>64KB of K7 and just 1024 words of P4.
>>>
>>>The P4 is using 64 bits adressing for the L1 that means just 1024 words.
>>>I prefer personally 16384 words of 32 bits.
>>>
>>>However the P4 doesn't deliver 2048 words of 32 bits. It delivers 1024
>>>words of 64 bits.
>
>>Um, what? Xeon has used 36 bits for L1 and L2 address tags since the days of the
>
>>Pentium Pro because of the PAE/PSE36 addressing extensions. The chips run on a
>>36-bit address bus, not a 64-bit address bus.
>
>This info i had received by email long before the P4 was introduced and
>before they released design documents about it. A lot of data i have
>here is outdated
>usually when the processor then gets introduced. That is sad such a big
>disinformation caused by everyone who wants to publish about it. I would want
>to look this up though. 36 bits looks very silly to get from P4 processors L1
>or L2 cache as i work with 32 bits variables and when talking about node
>count it is 64 bits, but definitely never 36 bits.

You can't even read.  He is talking about _addressing_.  You said addressing
but you meant fetching, not addressing.  Until you can talk with the same
vocabulary everyone else uses, conversation is impossible.

The xeons have a 36 bit address _space_.  IE 64 gigabytes of RAM can be
addressed.  And tagged within the cache.  The processors fetch 64 bits
from cache to the processor.  _all_ processors (intel compatible) are
using a 64 bit memory bus at the moment.

>
>>The cache is the cache, and the Pentium 4 and K7 caches are equally capable of
>>delivering bytes, two bytes, four bytes, eight bytes, or sixteen bytes on
>>respectively aligned address boundaries. The K7 has a line size of 64 bytes (not
>>bits), and the Pentium 4 has a line size of 128 bytes. Eugene clarified some
>>confusion about the Pentium 4 line size, but that is completely irrelevant here.
>>
>>-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.