Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: 3.06 Xeon Test Results

Author: Vincent Diepeveen
Date: 06:53:04 04/10/03
On April 10, 2003 at 00:39:24, Anthony Cozzie wrote:

>On April 09, 2003 at 23:00:58, Robert Hyatt wrote:
>
>>On April 09, 2003 at 21:46:19, Anthony Cozzie wrote:
>>
>>>Crafty seems to be unique in that it gets a lot from hyperthreading - Fritz does
>>>not, as Charles' benchmarks show.
>>>
>>>SMT is not a guaranteed win.   For SMT to accelerate a chess program, the
>>>following inequality must be true:
>>>
>>>(1+S)*1.7/2 > 1
>>>
>>>=> S >= 17% (approximately: SMP speedup varies a lot over various positions)
>>>
>>>In other words, Crafty on a PIV will get 30% from hyperthreading and a positive
>>>speedup, Fritz on a PIV will get 10% from hyperthreading and a relative slowdown
>>>in search (even as NPS go down).
>>>
>>>anthony
>>
>>
>>I do _not_ follow a discussion which talks about a slowdown.
>>
>>If a program produces _any_ speedup on two cpus, then it should produce a
>>speedup using less than two cpus, but more than one cpu (SMT in other words).
>>
>>If your NPS goes up by 10%, then with a 1.7x multiplier on two real cpus,
>>the program should run 1.07X faster using SMT.
>>
>>If the NPS goes _down_ with SMT on, something else is broken, either in the
>>software or the hardware.
>
>My understanding of SMT is as follows.  The processor divides its resources
>(issue queue, functional units, cache, etc) among two threads.  Now, I *think*
>that said threads are equal, that is, that they both get 50% of the CPU.
>Otherwise, special OS code would have to be written to support SMT.
>
>Therefore, suppose fritz on a PIV gets 1000 knps, while with SMT enabled it gets
>1100 knps.  That would mean that each "virtual cpu" is creating 550 knps.
>1.7x550 = 935 equivalent -> slower.  So while NPS goes up, time to solution goes
>down.  The increased raw speed doesn't compensate for the SMP inefficiencies.
>Now with *crafty*, the NPS increase is so big that it is worthwhile.
>
>anthony

bob is busy with 2 Xeon cpu's that split into 4 processors.
that makes all math a lot harder for everyone.

Let's do more realistic math. For single cpu P4 3.06Ghz, because that's the only
P4 with SMT turned on. Only 3.06Ghz or faster P4s have SMT.

your program at the P4 3.06 running single cpu is of course at 1.0 speed of a
program. It executes it, no problems.

Now when we move from single cpu to running a software program in this case at 2
processes.

Then 2 things happen.
  a) you get a sequential speedup
  b) you get a slowdown because parallel software doesn't have perfect speedup.

I will show you 2 things.
  a) with bobs numbers crafty still is slower than when running it single cpu
  b) there is a lot of other factors to take into account too which color the
picture of bob even worse.

Bob claims he gets a sequential speedup of 30% out of crafty at SMT. I do not
believe that at all, and world wide crafty is shown to have 20% speedup. But
let's use Bob's numbers for now.

1.0 + 30% = 1.3  right?

Now bob claims crafty has a 1.7 speedup at 2 processes. I do not believe that.
Even his own testings look more like 1.6, but let's use his own number: 1.7.

(1.7 / 2) * 1.3 = 1.10

So that shows a 10% speedup practical, right?

First problem:
  a) single cpu versions are much faster than parallel versions. the 1.7
     speedup is of a multithreading crafty versus the same crafty version with
     1 thread, instead, optimized to not do all that SMP stuff it is
     much faster. We can go into lengthy discussions how much. Everywhere it
     uses pointers which it simply doesn't need when run single cpu.
     I claim 15% slowdown here in total. But Bob will probably deny it and
     say it is no more than 5% after weeks of discussions. Let's avoid it,
     but it is SLOWER simply multithreaded with MT=1 than a special single cpu
     version
  b) bob claims 1.7 speedup, but my own tests show 1.4 speedup and others show
1.6 speedup. the difference is that i compared with a very well compiled crafty
single cpu version (when the old versions could still compile at other compilers
than intel c++) for the K7, versus using the same compiler for SMT compile.

Bob's 1.7 claim speedup out of 2 processors is based upon very old processors
out of ancient history and like 3 or 4 careful chosen positions.

This where IMHO in most positions programs search already deep enough and you
want to be faster and search deeper only in those positions where search has
problems single cpu. that's usually positions where score goes slowly down and
where programs hesitate a lot.

At these positions crafty at his own machine was shown to have a 2.8 speedup out
of 4 processors. Not 3.1. His own tests at unimportant positions showed a 3.0
speedup. Not even near the 3.1 claim from himself. So he's already indicating
his own tests are not very well. Apart from that those positions tested are
simply not relevant for game positions.

Get a bunch of positions where crafty made mistakes in world champs last few
years. And you'll see.

Good example is for example that blunder against deepthought (endgame) blunders
against Junior (crafty pawn up in wmcc 2001), DIEP, and many others.

Anyway you see how important the math becomes.

Now the SMT speedup. Bob's 30% claim is not based upon number of nodes seen.
It is based upon either search times or whatever. Never something objective.

Show me how many nodes a second you get!

He just doesn't do it. Crafty doesn't print number of nodes a second.

For Shredder SMT gives a slight speedup about 10%. For Fritz it gives about
10-11% speedup. For DIEP at the 2.8Xeon and below it gives a 10% speedup in
nodes a second.

For crafty you can measure it yourself, if you somehow can get a correct nodes a
second shown somewhere. The only way to do that is putting it to a certain
search depth. Bob never did. He quotes something here.

A test performed here it was 20% speedup. Test by himself. 20% is a more likely
number anyway. That is also the claim by intel. That it should give a 20% boost
in nodes a second. The big secret of crafty is that no one can compile it at all
parallel unless you have the intel c++ compiler. And it doesn't show nodes a
second. And if you modify it to do it, then he will claim it is incorrectly
showing the nodes needed for mainlines! How pathetic!

Let's use these numbers to calculate the speedup.

Assumption a) 20% speedup sequential.
Assumption b) speedup 2.8 at difficult positions out of 4 processors.

2 processors x 1.2 = 2.4 speed in nodes a second.
2.4 * 2.8 / 4  = 1.68 speedup effectively. Missing over 34% in performance in
short.

So you see how important the parallel performance of a program is.
If we use bob's inaccurate numbers it gets:

30% + 2.0 = 2.6
2.6 * 3.0 / 4 = 1.95

He his own math isn't so very good either!

So at his OWN machine. Using his OWN numbers, his machine is slower.

As proven.

But imagine how much better it has to perform to get a positive speedup out of
it!
Re: 3.06 Xeon Test Results Robert Hyatt 08:31:34 04/10/03
Re: 3.06 Xeon Test Results Charles Worthington 08:18:43 04/10/03
- Re: 3.06 Xeon Test Results Vincent Diepeveen 06:47:42 04/11/03
  - Re: 3.06 Xeon Test Results Robert Hyatt 07:59:31 04/11/03
- Re: 3.06 Xeon Test Results Jay Urbanski 19:39:40 04/10/03
- Re: 3.06 Xeon Test Results Robert Hyatt 11:23:51 04/10/03
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.