Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: DTS article robert hyatt - revealing his bad math

Author: Robert Hyatt
Date: 12:51:17 09/03/02
On September 03, 2002 at 11:56:48, Vincent Diepeveen wrote:

>We all know how many failures the past years parallel programs have been
>when developed by scientists. This years diep show at the teras was no
>exception to that. The 3 days preparation time i had to get
>to the machine (and up to 5 days before tournament
>i wasn't sure whether i would get system time *anyway*).
>
>However sponsors want to hear how well your thing did. At a 1024
>processor machine (maximum allocation 512 processors within 1 partition
>of shared memory) from which you get 60 with bandwidth of the memory
>2 times slower than local ram, and let's not even *start* to discuss
>the latency otherwise you will never start to fear diep using that
>machine. All i can say about it is that the 20 times slowed down
>Zugzwang was at 1999 at a machine with faster latency...
>
>I'm working hard now to get a DIEP DTS NUMA version ready.
>
>DTS it is because it is dynamic splitting wherever it wants to.
>
>Work for over a month fulltime has been done now. Tests at a dual K7
>as well as dual supercomputer processors have been very positive.
>
>Nevertheless i worried about how to report about it. So i checked out the
>article from Robert Hyatt again. Already in 1999 when i had implemented
>a pc-DTS version i wondered why i never got near the speeds of bob
>when i was not forward pruning other than nullmove. The 1999 world champs
>version i had great speedups, but i could all explain them by forward
>pruning which i was using at the time.


Vincent, that is _utter_ hogwash.  In 1999 you hadn't done anything remotely
like DTS.  And, in fact, 6 months ago you asked me for additional details on
how I chose to split, which further suggests that "your dts" is not the same
as "my dts"...




>
>Never i got close even dual xeon or quad xeon to speeds reported by Bob
>in his DTS version described 1997. I concluded that it had to do with
>a number of things, encouraged by Bob's statements. In 99 bob explained
>that splitting was very cheap at the cray. He copied a block with all
>data of 64KB from processor 0 to P1 within 1 clock at the cray.

Never said that.  I said that once you start a copy on the cray, after the
memory latency period, it moves 16 bytes per clock cycle, but you can start
using the data after the first 16 bytes are clocked over, because of how
vectors work on that machine...


>
>I didn't know much of crays or supercomputers at the time, except that
>they were out of my budget so i believed it. However i have a good memory
>for certain numbers, so i have remembered his statement very well.


:)

What more can I say.  You might "remember" but you _sure_ don't understand
them.  I've tried twice to explain how we did mobility, which you said couldn't
be done...  So there is little point...




>
>In 2002 Bob explained the cray could copy 16 bytes each clock. A
>BIG contradiction to his 1999 statement. No one here will wonder
>about that, because regarding deep blue we have already seen hundreds
>of contradicting statements from bob. Anyway, that makes
>splitting at the cray of course very expensive, considering bob copied
>64KB data for each split.

64 KB = 4K 16byte chunks.  At 4ns per chunk, that is 16K nanoseconds.
or 16 microseconds.  You think that is too long?  When you don't do
but a few thousand splits per move?

Your math sucks badly...  as always...

And I am sure you will now tell me how it can't be done in 16 microseconds,
after which I will refer you to the folks up at Cray Research and let them
explain to you how it _can_...





> Crafty is no exception here.

What does this have to do with anything???  Crafty can easily be tested by
anyone on the planet to see how it performs...  It doesn't copy as much data
as cray blitz, probably averaging maybe 4K-8K bytes per copy, with a max
of something like 50K or whatever number I gave you when I added up the
entire size of the tree data structure.  Of course, that would be another
interesting bit of analysis on your silly statements.

You told GCP "you should do this efficiently like bob does in Crafty and
not copy so much data."  I told you "Vincent, I copy up to XXK depending
on the depth of the split point."  GCP laughed.  You then started "your
split overhead is horrible, no wonder you can't do ..." and off you went
into the wild blue yonder...





>
>I never believed the 2.0 speedup in his tabel at page 16 for 2 processors,
>because if i do a similar test i sometimes get also > 2.0, usually less.

If you look at the table, it seems pretty reasonable.  I don't get many
speedups of >2.0, period.  And I didn't get _any_ on that test.  The
speedups bounced from 1.7 to 2.0, rounded to the nearest 1 decimel place...


>
>Singular extensiosn hurted diep's speedup incredible, but even today
>i cannot get within a few minutes get to the speedup bob achieved in
>his 1997 article.


So that means I published fake data?  Because _you_ can't reproduce it?

:)




>
>In 1999 i wondered about why his speedup was so good.
>So Bob concluded he splitted in a smarter way when i asked.
>Then i asked obviously how he splitted in cray blitz, because
>what bob is doing in crafty is too horrible for DIEP to get a speedup
>much above 1.5 anyway. I asked obviously how he splitted in cray blitz.
>
>The answer was: "do some statistical analysis yourself on game trees
>to find a way to split well it can't be hard, i could do it too in
>cray blitz but my source code is gone. No one has it anymore".

I believe the DTS article explains where I chose to split quite clearly.
If you can't follow, that is a problem with your comprehension, or a problem
with my explanation.

I wrote exactly what I did, however...

>
>So you can feel my surprise when he suddenly had data of crafty versus
>cray blitz after 1999, which bob quotes till today into CCC to proof how
>well his thing was.


Do you understand the difference between "source code" and "executable code"???

I have more than one executable.  I lost almost everything around 1996-1997.
I don't recall exactly when.  But it would be easy to go back to the old
crafty mailing list stuff to see when I sent out a request for old versions
because they had all been lost here due to a disk failure and a year's worth of
backup tapes that were not readable.  I believe it was even posted on r.g.c.c
as well...

I wasn't happy.  I still have not found all the old source versions, some of
which were not even released.  But that's the way it goes...




>
>Anyway, i can analyze games as FM, so i already knew a bit about how well
>this cray blitz was. I never paid much attention to the lies of bob here.

<lies>???

>
>I thought he was doing this in order to save himself time digging up old
>source code.

No.  I just didn't have it.  I generally could dig up an old executable as
they were distributed in the Cray user's group library tapes.  But not
source...


>
>Now after a month of fulltime work at DIEP at the supercomputer and having
>it working great at a dual (and very little overhead) but still a bad
>speedup i started worrying about my speedup and future article to write
>about it.
>
>So a possible explanation for the bad speedup of todays software when compared
>to bob's thing in 1993 and writing about it in 1997 is perhaps explained
>by nullmove. Bob still denies this despite a lot of statistical data
>at loads of positions (150 positions in total tried) with CRAFTY even.

OK... let's take your credibility to task here.

1.  You were asked by some sponsor to provide a faster speedup than anything
yet produced.  You told them "I can do that easily, I am the best chess
programmer on the planet."

2.  You failed to do so.

3.  You approached me and said you couldn't do it, and that it must be
for one of two reasons:

  (a) null-move makes it harder to get a good speedup today.  I responded
      by pointing out that Cray Blitz used null move, although R=1 and non-
      recursive.  I then said "null-move does not affect the speedup
      significantly."  You wouldn't have any of it.  You asked me to run the
      test using 4 cpus and selective 0/0 (no null move) vs selective 2/3
      (normal null-move in crafty.)  I did.  non-null-move produced a speedup
      of 3.1 on the Cray Blitz test set you insisted I use.  Null-move produced
      a speedup of 3.0.  GCP ran the same test and got 2.9 and 2.8.  So you
      can't use null-move as an argument because you will be laughed out of
      town since anybody can test your claim and prove it is garbage.

  (b)  I had faked the Cray Blitz data to make it look better than it really
      was.  My original DTS dissertation produced a 16 cpu speedup of about
      9 searching at .7 MIPS, to a depth of only 9 plies, which is all I could
      manage on that hardware.  I later ran the test in a totally different way,
      so that there was continuity between the moves by playing the game and
      pondering normally, and the speedup improved to 11, not considering that
      the searches were far deeper on the C90 as well.

You were insistant on finding _some_ excuse as to why a NUMA machine can't
approach the speed of the Cray.  I pointed out that you would have absolutely
no credibility with anyone by even trying to prove that since nobody in their
right mind would expect a NUMA machine to compete with a $60,000,000 super-
computer.  But you were not to be deterred.  You _must_ either get your speedup
better, or else claim mine were false...

Wonderful reasoning...





>
>Bob doesn't find that significant results. Also he says that not a
>single of MY tests is valid because i have a stupid PC with 2 processors
>and bad RAM. a dual would hurt crafties performance too much.

Vincent, if I had any idea what your problem was, I would try to help.
I think it amusing that you have asked me for machine time in the past,
which I happily provided to you.  You asked me for programming help which
I happily provided.  And then you pull this sort of nonsense.  Don't expect
any _further_ help, of course.

You directly said in an email to me that I will be happy to post here if
you want to deny it "Crafty gets no speedup at all on these Cray Blitz
test positions, on my dual."

I said bull and ran them and gave you the log.  I think GCP ran them on
the quad I am loaning him and he _also_ got a speedup within .2 of what
I sent you.

So whose test is flawed?

Whose test is _always_ flawed???






>
>This because i concluded also that the speedup crafty gets here
>is between 1.01 and 1.6 and not 1.7.


Conclude what you want.  I tend to like to _prove_ what I post,
regarding Crafty.  I posted yet another example of your nonsense
this morning about the SMP vs non-SMP version.  Where you get your
data is a mystery to me...




>
>Data suggests that crafties speedup at his own quad is about 2.8,
>where he claims 3.1.

Before you go on, GCP produced 2.8 on one run.  I produced 3.0 on the
same test set (on a faster machine).  Is that .2 difference _really_
driving you nuts here?  Do you see just how small a difference that
really is???

And do you agree that 2.8 is _far_ better than what you suggest I can
get with your wild rambling nonsense???

I don't claim Crafty gets any _specific_ speedup at all.  If you look
at my answers to questions about that, I always say "as a general rule,
the formula is speedup = 1 + (.7 * (NCPUS-1))/".  "as a general rule".
Not "in every case".  The speedup varies significantly from one position
to another, and on the same position from one run to another.  This is about
statistical averages, nothing more.

One day you might eventually "get it."





>
>Then bob referred back to his 1997 thesis that the testmethod wasn't good.
>Because to get that 2.8 we used cleared hashtables and in his thesis he
>cheats a little by not clearing the tables at all. to simulate a game
>playing environment that's ok of course.
>
>However there is a small problem with his article. The search times and
>speedup numbers are complete fraud. If i divide the times of 1 cpu by
>the speedup bob claims he has, i get perfect numbers nearly.

And should you not get that?  I computed the speedup by dividing the
time for one cpu by the time for N cpus.  So of course if you go in the
opposite direction you should get the original data.  :)

jeez...




>
>Here is the result for the first 10 positions based upon bob's article
>march 1997 in icca issue #1 that year, the tables with the results
>are on page 16:
>
>When diep searches at a position it is always a weird number.
>If i claim a speedup of 1.8 then it is usually 1.7653 or 1.7920 or 1.8402
>and so on. Not with bob. Bob knows nothing from statistical analysis
>of data (i must claim innocent here too but i am at least not STUPID
>like bob here):
>
>pos   2      4      8   16
>1  2.0000 3.40   6.50   9.09
>2  2.00   3.60   6.50  10.39
>3  2.0000 3.70   7.01  13.69
>4  2.0000 3.90   6.61  11.09
>5  2.0000 3.6000 6.51   8.98876
>6  2.0000 3.70   6.40   9.50000
>7  1.90   3.60   6.91  10.096
>8  2.000  3.700  7.00  10.6985
>9  2.0000 3.60   6.20   9.8994975 = 9.90
>10 2.000  3.80   7.300 13.000000000000000
>
>This clearly PROOFS that he has cheated completely about all
>search times from 1 processor to 8 processors. Of course
>now that i am running myself at supercomputers i know what is
>the problem. I only needed a 30 minute look a month ago
>to see what is in crafty the problem and most likely that was
>in cray blitz also the problem. The problem is that crafty
>copies 44KB data or so (cray blitz 64KB) and while doing that
>it is using smp_lock. That's too costly with more than 2 cpu's.

how does that "proof" anything?

And what does "smp-lock" have to do with anything?  I told you, it is
_not_ a problem in Crafty.  I _know_.  Just because you can't do it right
doesn't mean I can't...





>
>This shows he completely lied about his speedups. All times
>from 1-8 cpu's are complete fraud.
>
>There is however also evidence he didn't compare the same
>versions. Cray Blitz node counts are also weird.


I used _exactly_ the same version for all tests.  So again, your
rambling leaves me wondering what kind of medication you are on and
how you have screwed up the dosage...




>
>The more processors you use the more overhead you have obviously.
>Please don't get mad at me for calculating it in the next simple
>but very convincing way. I will do it only for his first node
>counts at 1..16 cpu's, the formula is:
>  (nodes / speedup_i-cpu's ) * speedup_i+1_cpu's
>
>1 to 2 cpu's we don't need the math.
>If you need exactly 2 times shorter to get to it but
>thereby you need more nodes at more cpu's (where you need
>expensive splits) then that's already weird of course, though
>not impossible.


Not wierd to me.  Using one extra cpu leaves plenty of room to find really
good split points.  The more there are, the more split points there must
be and the greater the odds of bad ones being chosen.  That's the reason
that later tests on 32 cpus was even worse...




>
>2 to 4 cpu's:
> 3.4 * (89052012 / 2.0) = 151388420.4 nodes.
>  bob needed: 105.025.123 which in itself is possible.
>  Simply like 40% overhead extra for 4 processors which 2 do
>  not have. This is very well possible.
>
>4 to 8 cpu's:
>  6.5 * 105025123 nodes / 3.4 = 200.783.323
>  bob needed: 109MLN nodes
>  That means at 8 cpu's the overhead is already approaching
>  100% rapidly. This is very well possible. The more cpu's
>  the bigger the overhead.
>
>8 to 16 cpu's:
>  9.1 * (109467495 / 6.5) = 153254493
>  bob needed: 155.514.410
>
>My dear fellow programmers. This is impossible.



I have absolutely no idea what you are talking about.  The node totals I
published in that paper are simply raw node counts from the program.  The
first 1,2,4,8 cpu tests didn't have a _lot_ of overhead.  At 16 the overhead
jumped.  So what?  want me to run a few tests in Crafty?  I have seen that
more than once.  2 processors = 1.8 speedup, 4 = 1.9.  So what?  I would have
liked to not see that overhead creep in, but it did, and I reported it, and it
hurt the results.

Again, so what?


>
>Where is the overhead?
>
>The factor 100% at least overhead?
>
>More likely factor 3 overhead.


What overhead are you talking about?  the _only_ overhead I had in
Cray Blitz was search overhead (extra nodes).  Just like that is all
I have in current Crafty.  I lose about 2-3% of 1 cpu in a 3 minute search
to spinlocks and memory conflicts.  The rest of the machine (3.97 processors
are _busy_ searching.  Unfortunately, they are searching stuff that is not
always important.  But they are _always_ searching _something_...


>
>The only explanation i can come up with is that the node counts
>from 2..8 processors are created by a different version from
>Cray Blitz than the 16 processor version.

Conclude anything you like, it doesn't make it true.  Just more of the
same old garbage...


>
>From the single cpu version we already know the number of nodes gotta
>be weird because it is using a smaller hashtable (see page 4.1 in the
>article second line there after 'testing methodology').


Better read it again.  The same size hash was used everywhere.  But in the
one processor tests, there were more memory conflicts because _others_ were
using the machine.  I'll wait for you to show me exactly where I said I used
a smaller hash table, but I won't hold my breath.  I will include the paragraph
for those that don't have the JICCA article:

==============================================================================
The testing methodology was to take the 16-processor log produced during the
actual game, and then "contrive" things so that the "lesser" configurations
would do the same amount of work (roughly).  In these tests, if the 16
processor search reached 11 plies and searched the first 10 root moves
before the search timed out, then all lessor configurations had to also
do exactly this same search, which led to some embarrassingly long search
times for one processor as will be seen.  One other minor note is that the
single processor times appear to search slightly slower than the equivalent
parallel searches, which would seem to be counter-intuitive.  However, due
to the enormous time required to play through this game without stopping
(which would have cleared the transposition table and so forth) we ran
this test on a production machine, and competed with other processes.  As a
result, memory conflicts were much higher (bank conflicts to those that are
Cray-savvy) as well as swapping overhead which gets charged to the user.
While this is well below the .1% level of noise, it is noticeable, and should
be remembered.
================================================================================


Nowhere I can see says "smaller hash".  I included that because it certainly
was true that the one processor tests were slightly _slower_ than normal, due
to memory conflicts caused by other users running other stuff at the same time.
I had no way to measure this for the entire run, so you could (if you choose)
subtract some time from the one-cpu tests to slightly lower the overall
speedup values.  But I did several tests just to measure this and it didn't seem
to be significant, hence the .1% comment at the end.  No attempt to be
dishonest, but I guess I failed to write it in small enough words so that you
could follow what it meant...


>
>We talk about mass fraud here.
>
>Of course it is 5 years ago this article and i do not know whether
>he created the table in 1993.

The game was played in 1993.  The results were produced over the next
year.  The paper took some time and then it took about a year of haggling
with the ICCA and the referees, a couple of which looked at the raw CB
output to compare it to the data in the article, which is pretty common of
the review process...  I don't remember exactly when it was finally published,
but the review process was definitely slowed down by both the reviewers asking
questions and wanting clarifications in some of the wording, as well as my
having to find time to do what they wanted...





>
>How am i going to tell my sponsor that my speedup won't be the same
>as that from the 1997 article? To whom do i compare, zugzwang?
>'only' had on paper 50% speedup out of 512 processors. Of course also
>something which is not realistic. However Feldmann documented most of
>the things he did in order to cripple zugzwang to get a better speedup.
>
>A well known trick is to kick out nullmove and only use normal alfabeta
>instead of PVS or other forms of search. Even deep blue did that :)
>
>But what do you guys think from this alternative book keeping from Bob?
>


I don't think it is "bob" that has the problem here.  You have shown that
you have absolutely no moral value, because rather than tell your "sponsor"
why you can't directly compete with a supercomputer (because you don't believe
they will like it) you want to do more of the "diep proof" approach to things
and forge ahead.  Note that any attempt to show that a NUMA machine will perform
better than a C90 is going to get you laughed out of any discussion you care to
join and bring that up.  But then again, you don't seem to care about fact
and reason anyway...  so...




>Best regards,
>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.