Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: 3.06 Xeon Test Results

Author: Robert Hyatt
Date: 08:31:34 04/10/03
On April 10, 2003 at 09:53:04, Vincent Diepeveen wrote:

>On April 10, 2003 at 00:39:24, Anthony Cozzie wrote:
>
>>On April 09, 2003 at 23:00:58, Robert Hyatt wrote:
>>
>>>On April 09, 2003 at 21:46:19, Anthony Cozzie wrote:
>>>
>>>>Crafty seems to be unique in that it gets a lot from hyperthreading - Fritz does
>>>>not, as Charles' benchmarks show.
>>>>
>>>>SMT is not a guaranteed win.   For SMT to accelerate a chess program, the
>>>>following inequality must be true:
>>>>
>>>>(1+S)*1.7/2 > 1
>>>>
>>>>=> S >= 17% (approximately: SMP speedup varies a lot over various positions)
>>>>
>>>>In other words, Crafty on a PIV will get 30% from hyperthreading and a positive
>>>>speedup, Fritz on a PIV will get 10% from hyperthreading and a relative slowdown
>>>>in search (even as NPS go down).
>>>>
>>>>anthony
>>>
>>>
>>>I do _not_ follow a discussion which talks about a slowdown.
>>>
>>>If a program produces _any_ speedup on two cpus, then it should produce a
>>>speedup using less than two cpus, but more than one cpu (SMT in other words).
>>>
>>>If your NPS goes up by 10%, then with a 1.7x multiplier on two real cpus,
>>>the program should run 1.07X faster using SMT.
>>>
>>>If the NPS goes _down_ with SMT on, something else is broken, either in the
>>>software or the hardware.
>>
>>My understanding of SMT is as follows.  The processor divides its resources
>>(issue queue, functional units, cache, etc) among two threads.  Now, I *think*
>>that said threads are equal, that is, that they both get 50% of the CPU.
>>Otherwise, special OS code would have to be written to support SMT.
>>
>>Therefore, suppose fritz on a PIV gets 1000 knps, while with SMT enabled it gets
>>1100 knps.  That would mean that each "virtual cpu" is creating 550 knps.
>>1.7x550 = 935 equivalent -> slower.  So while NPS goes up, time to solution goes
>>down.  The increased raw speed doesn't compensate for the SMP inefficiencies.
>>Now with *crafty*, the NPS increase is so big that it is worthwhile.
>>
>>anthony
>
>bob is busy with 2 Xeon cpu's that split into 4 processors.
>that makes all math a lot harder for everyone.
>
>Let's do more realistic math. For single cpu P4 3.06Ghz, because that's the only
>P4 with SMT turned on. Only 3.06Ghz or faster P4s have SMT.

I have a PIV.  At 2.8ghz.  It has SMT.  In fact, _both_ processors in my dual
PIV
box have SMT.  I have six dual PIV 2.4ghz boxes.  _all_ of them have SMT.  Feel
free to drop by and I'll show 'em to you.


>
>your program at the P4 3.06 running single cpu is of course at 1.0 speed of a
>program. It executes it, no problems.
>
>Now when we move from single cpu to running a software program in this case at 2
>processes.
>
>Then 2 things happen.
>  a) you get a sequential speedup
>  b) you get a slowdown because parallel software doesn't have perfect speedup.
>
>I will show you 2 things.
>  a) with bobs numbers crafty still is slower than when running it single cpu

I'll run a couple of tests.  I can't run single vs dual, unfortunately, because
I can't
remove one of these CPUs as the machine won't boot without a terminator in the
empty processor slot and I don't have one.  But I will run some dual SMT off vs
dual SMT on tests  and post the times required and the speedup from SMT=on...


>  b) there is a lot of other factors to take into account too which color the
>picture of bob even worse.
>
>Bob claims he gets a sequential speedup of 30% out of crafty at SMT. I do not
>believe that at all, and world wide crafty is shown to have 20% speedup. But
>let's use Bob's numbers for now.

I have claimed _both_.  On one test set, 30%  on another test set 20%.  Is that
so hard to remember?  This is not an exact number.  Any more than parallel
speedup
is an exact number.  It varies by test position set...  As _always_.


>
>1.0 + 30% = 1.3  right?
>
>Now bob claims crafty has a 1.7 speedup at 2 processes. I do not believe that.
>Even his own testings look more like 1.6, but let's use his own number: 1.7.
>
>(1.7 / 2) * 1.3 = 1.10
>
>So that shows a 10% speedup practical, right?
>
>First problem:
>  a) single cpu versions are much faster than parallel versions. the 1.7
>     speedup is of a multithreading crafty versus the same crafty version with
>     1 thread, instead, optimized to not do all that SMP stuff it is
>     much faster.

False information.  I ran a SMP vs non-SMP test on my quad and posted the
results.  there
is no "slowdown" because of how the code is written.  two additional
instructions per node
searched (non-q-search nodes only) is _all_ the overhead I have for the SMP
version when
run without multiple threads.

non-q-search nodes represent maybe 10% of the total search.  I don't think I can
even measure
the cost of two instructions extra, in 10% of the total nodes.



> We can go into lengthy discussions how much. Everywhere it
>     uses pointers which it simply doesn't need when run single cpu.
>     I claim 15% slowdown here in total. But Bob will probably deny it and
>     say it is no more than 5% after weeks of discussions.

You know your number is wrong.  Because, when I wrote version 15.0, I compared
it
to the previous version and the slowdown was in the 5-7% range using gcc.  It
might
be better with ICC.  I can't help what _your_ slowdown was, but I _know_ what
mine
was because, as always, I took the time to measure it to see what it was
costing.

> Let's avoid it,
>     but it is SLOWER simply multithreaded with MT=1 than a special single cpu
>     version
>  b) bob claims 1.7 speedup, but my own tests show 1.4 speedup and others show
>1.6 speedup. the difference is that i compared with a very well compiled crafty
>single cpu version (when the old versions could still compile at other compilers
>than intel c++)

The current versions compile with gcc just fine...


> for the K7, versus using the same compiler for SMT compile.
>
>Bob's 1.7 claim speedup out of 2 processors is based upon very old processors
>out of ancient history and like 3 or 4 careful chosen positions.


Hmm... Didn't _you_ choose the last set of positions we tested?  I've posted
results
for Kopec, for part of the BT positions, for all of the Cray Blitz vs Mchess
positions,
etc...

"carefully chosen"?  Perhaps carefully chosen by _you_.  Not by me...


>
>This where IMHO in most positions programs search already deep enough and you
>want to be faster and search deeper only in those positions where search has
>problems single cpu. that's usually positions where score goes slowly down and
>where programs hesitate a lot.
>
>At these positions crafty at his own machine was shown to have a 2.8 speedup out
>of 4 processors. Not 3.1. His own tests at unimportant positions showed a 3.0
>speedup. Not even near the 3.1 claim from himself.


You have a _severe_ language and thinking problem.  3.0 is "not even near
3.1?"????

:)

:)

:)

You always want to start this discussion, based on the premise "I can't make my
parallel
search work very well, so I will try to knock every other parallel algorithm to
make
everyone think they are worse."

Doesn't fly.  I have test sets that will produce speedups from 1.5 to 2.3 for
two processors.
Because the speedup varies by position.  And that's why I always run a _set_ of
positions
so that the speedup is not biased in either direction.  And that's why _you_
always want to
choose a few positions where I do very poorly, without balancing those with
positions where
I do very well.

That is bias...  and it is dishonest.  I, at least, don't subscribe to it.




> So he's already indicating
>his own tests are not very well. Apart from that those positions tested are
>simply not relevant for game positions.

Other than the fact that they _came_ from game positions?  :)  The last test we
ran
where I produced 3.0 out of 4.0 came from a _real_ game.  Did you forget that?
Of
course not.  you just want to be dishonest...



>
>Get a bunch of positions where crafty made mistakes in world champs last few
>years. And you'll see.
>
>Good example is for example that blunder against deepthought (endgame) blunders
>against Junior (crafty pawn up in wmcc 2001), DIEP, and many others.
>
>Anyway you see how important the math becomes.
>
>Now the SMT speedup. Bob's 30% claim is not based upon number of nodes seen.
>It is based upon either search times or whatever. Never something objective.

It was based on raw NPS.  Nothing more.  Nothing less.  And I _clearly_ stated
that.
The question at the time was does SMT work.  The answer was yes, the program ran
30% faster (in one test set, 20% in another) in terms of raw NPS.  Makes no
sense to
factor in parallel search efficiency _there_ because the question was not "does
SMT make
a _chess program_ faster?"  It was "does SMT make a program faster overall or
not?"

Not all algorithms have the parallel search overhead that tree searching does...


>
>Show me how many nodes a second you get!
>
>He just doesn't do it. Crafty doesn't print number of nodes a second.

Eh?

It has _always_ printed the number of nodes a second.  And I do mean _always_.
Again, I have no idea what your personal agenda is here, but _anybody_ can look
at a logfile from crafty and see the _precise_ NPS value for each position
searched.


>
>For Shredder SMT gives a slight speedup about 10%. For Fritz it gives about
>10-11% speedup. For DIEP at the 2.8Xeon and below it gives a 10% speedup in
>nodes a second.
>
>For crafty you can measure it yourself, if you somehow can get a correct nodes a
>second shown somewhere. The only way to do that is putting it to a certain
>search depth. Bob never did. He quotes something here.

I _always_ search to a fixed depth when testing.  And I do mean _always_.

You know that...


>
>A test performed here it was 20% speedup. Test by himself. 20% is a more likely
>number anyway. That is also the claim by intel. That it should give a 20% boost
>in nodes a second. The big secret of crafty is that no one can compile it at all
>parallel unless you have the intel c++ compiler.

False.  Compiles just fine for me using the GCC compiler as well.  Doesn't run
as
fast, but it compiles cleanly on RedHat 8.0.  It _always_ has...


> And it doesn't show nodes a
>second. And if you modify it to do it, then he will claim it is incorrectly
>showing the nodes needed for mainlines! How pathetic!

Can you spell "idiot"?

You are saying "it doesn't show nodes a second".  Which is a provable lie.  What
you
mean, but apparently can't clearly say, is "it doesn't show nodes searched when
it produces
a new PV."

The two are _not_ the same thing.

NPS is _always_ given.



>
>Let's use these numbers to calculate the speedup.
>
>Assumption a) 20% speedup sequential.
>Assumption b) speedup 2.8 at difficult positions out of 4 processors.
>
>2 processors x 1.2 = 2.4 speed in nodes a second.
>2.4 * 2.8 / 4  = 1.68 speedup effectively. Missing over 34% in performance in
>short.
>
>So you see how important the parallel performance of a program is.
>If we use bob's inaccurate numbers it gets:
>
>30% + 2.0 = 2.6
>2.6 * 3.0 / 4 = 1.95
>
>He his own math isn't so very good either!

I didn't do any math.  I ran the test.  That's the difference between us.  You
wave hands
and make wild statements.  I simply run the tests and report the results.  You
ought to
try that one day...


>
>So at his OWN machine. Using his OWN numbers, his machine is slower.
>
>As proven.
>
>But imagine how much better it has to perform to get a positive speedup out of
>it!
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.