Author: Robert Hyatt
Date: 10:12:04 05/02/04
Go up one level in this thread
Here is a concise explanation of the data as presented in the DTS article:
1. The original idea behind this paper was to answer the question "OK,
we know what your speedup numbers look like for test sets (at that time,
everyone was using the 24-position BK test set for their parallel speedup
analysis) but how does it do in real games?
In order to answer this, I chose to take a real game played at an ACM event
(the game Cray Blitz vs Mchess Pro played at the last ACM event) and try to
see how the speedup looked. This was different from previous testing because
this approach had "continuity" between positions, provided by the hash table
that carries information from one search to another, something that is
not possible when testing over unrelated positions such as the BK (or any
other) set of positions.
2. This was a process that took an incredible amount of computer time,
because the data I started off with was a real game played on a 16-cpu
Cray C90 that we used at the ACM event that year. Since the 16 cpu
searches were often very long, the 1-cpu tests were beyond long. In fact,
it often took a day or two to get one 1-cpu test position run. Each position
had to be run twice (explained below) which turned this into a long process.
What I found myself doing was for each position, already having the 16 cpu
search time (in whole seconds only), was to run the 1-cpu tests and after
each search was done, I computed the speed-up by hand and wrote it down in
a log I was creating (a 1-page log, with columns for 1, 2, 4, 8 and 16 cpus
across the top, position numbers given vertically down the side. Once this
was completed, I had the speed-up numbers I wanted (note that I had previously
reported BK numbers in my Ph.D. dissertation) from a real game.
3. I then wrote the paper and sent it to the JICCA. The first request was
for times and node counts, from one of the referees (who I don't know). I
was immediately faced with a dilemma on two counts. First, as most by now know,
a disk crash lost every file I had at the time. None of the backup tapes were
readable as has been previously mentioned.
However, before I discovered that, an immediate problem was obvious and I didn't
have any particularly good options. The problem was node counts. Here is some
sample Crafty output to illustrate the problem:
12-> 4.22 0.13 22. d5 Qd7 23. Bd2 a5 24. Bc3 a4 25.
Bxb4 cxb4 26. bxa4 bxa4 27. e5 Rac8
28. e6 fxe6 29. dxe6 (s=6)
13 5.41 0.17 22. d5 Qd7 23. Bd2 a5 24. Bc3 Na2 25.
e5 a4 26. e6 fxe6 27. dxe6 Qe8 28.
Nd2 d5 (s=5)
13-> 11.47 0.17 22. d5 Qd7 23. Bd2 a5 24. Bc3 Na2 25.
e5 a4 26. e6 fxe6 27. dxe6 Qe8 28.
Nd2 d5 (s=5)
14 20.09 0.13 22. d5 Qd7 23. Bd2 a5 24. Nf5 Bf8 25.
N3h4 g6 26. Nh6+ Bxh6 27. Bxh6 Re8
28. Qf3 Nc2 29. Re2 Nd4 (s=4)
Black(21): time 239800
time remaining: 39:58 (crafty).
Black(21): otim 206200
time remaining: 34:22 (opponent).
Black(21): Qe6
predicted move made.
time=1:20 cpu=398% mat=0 n=604895935 fh=90% nps=7.51M
ext-> chk=11081753 cap=2151421 pp=331649 1rep=1422592 mate=73081
predicted=18 nodes=604895935 evals=215912705
endgame tablebase-> probes=0 hits=0
SMP-> split=8749 stop=1140 data=11/128 cpu=5:20 elap=1:20
Looking at that, how does one come up with a "nodes" number? The 14 ply search
was not completed. All I know is that the 14 ply PV popped out at 20 seconds,
and after 80 seconds Crafty had searched almost 605M nodes. Even worse, I
wanted to accurately determine how long the 1-cpu search would take to do the
same amount of work and get the hash-table pre-loaded for the next search, just
as it was after the 16 cpu test. I had to run the 1-cpu test to see how long
it took to get the 14 ply PV in the above data, which gave me enough data
to compute search overhead. I could then compute how many nodes the search
needed to examine to be sure the hash table was properly loaded.
For example, in the above, the 4 cpu PV took 20 seconds. Suppose the 1-cpu
PV takes 60 seconds. The speedup is 3X, and the overhead is 25% since that
extra CPU was doing useless work. I then would go back and re-run the test
again on 1 cpu, but this time tell it to search 3/4 of the total nodes
searched by the 4 cpu test, which took even longer. I wrote this hash table
to disk so that when I started the next search I had the right stuff to read
back in and initialize the table correctly.
This ended up taking almost a _year_ to do. When I finished, I had a sheet
of notebook paper with the speedup matrix, and hundreds of log files, one
big one from the actual game, and at least two per position for each position
and different number of processors. This had the times, which I used to
compute my speedup numbers as well as the rather useless node counts that
had nothing to do with the search at the instant the PV was produced. And
this is the data that was lost.
When asked for the node counts by one of the referees, the only viable choice
was to compute them since I never had any specific node counts at the time the
PV was produced. I ran some tests and discovered that it was pretty easy
to predict them accurately because of the fact that the DTS algorithm had
very little "idle time" and the cpus were busy searching 100% of the time.
IE in the above example, multiply the 605M by .25 (the PV came out 1/4 of
the way through the total search done) and that number was very accurate.
Not in the last couple of digits of course, but accurate enough.
So that is how the data was "computed". Speedup came _directly_ from taking
the time for 1 cpu and dividing by the time for N cpus, where n was 2, 4, 8
and 16. Speedups were _perfectly_ computed. Node counts were later derived
from the speedup since there was no other way to compute them. Time was
computed the same way since at the point the data was requested, all I had
was the hard copy log from the 16 cpu game output, while all the other data
was kept electronically and was lost.
If you think about it, time and speedup is redundant. If you know one,
you can accurately compute the other. That is why the original paper only
reported speedup.
4. A couple of years later, Vincent got interested in DTS and started to
ask me about how to do this or that. Then he told me about his chance to
use a big machine. Before he knew how to spell NUMA. I pointed out that
the box he was looking at was NUMA and would cause him some difficulties.
He charged ahead and then discovered that his speedup numbers were not that
good, not a surprise on a large number of CPUs on a NUMA box (The Cray
C90 is fully SMP shared memory, with memory bandwidth that blows away
NUMA boxes completely).
He sent me an email trying to justify his poor performance. He first claimed
that it was an artifact of null-move. Testing disproved that. He then
decided that my data was bogus because the node counts and times could be
derived from the speedup numbers. I explained why but that was not to deter
him from trying to make his search look as good as possible, regardless of
anything or anybody else.
5. As I have said repeatedly, the speedup numbers were derived from raw data.
They vary all over the place. The times and nodes track the speedup numbers
very accurately since they were derived from them. However, for example, where
I computed a time of 425 seconds, the actual log file might have been somewhere
close, just so the ratio of 16CPUTime * speedup would be correct for a speedup
computed to 1 decimal place.
The node counts were _never_ accurate. I had (and still have) no way to
produce node counts each time I display a PV. Vincent has repeatedly modified
Crafty to do this, and I have repeatedly pointed out that it doesn't work
because I don't have a single global node counter, each thread has a node
counter for each split point. There is no way to "add 'em up" without making
the entire search come to a complete halt at a synchronization point, something
that is pointless from an efficiency point of view. Of course, he can keep
modifying the source without knowing a thing about how it works, but the numbers
he shows will always be wrong (the way he does it the numbers will _always_ be
less than the true node count).
But that won't stop him.
6. I explained all of this at the front of the paper. But as things
progressed, the editor was continually trying to shorten the paper (a
pretty common thing in journals) to save space. We ended up cutting
most of the introduction which would have made the derived data clear to
all. A mistake, but it really doesn't affect the final results one iota.
7. If you choose to call that fraud, fine by me. I stand by the speedup
numbers. I stand beside the testing methodology as it answered a question
I was asked repeatedly and never had an answer for. Of course, in the case
of Vincent, we've had these arguments before. He always disputes my results.
He always says "post the logs and everyone will see you are wrong." I post
them and he runs and hides when they don't support his position.
A classic example is based on my rough speedup estimate of
speedup = 1 + (NCPUS -1) * .7
On a quad, that says roughly 3.1X faster. He has, more than once, posted
results on a few positions and says "Aha, crafty only got 3.0 here, not
the 3.1 you lie about." :) Perhaps .1 is a big thing to him. To me it
is noise as .1 standard deviation is pretty small after working on parallel
search for 20+ years now.
Of course he has also posted that on a dual crafty gets _zero_ speedup, only
to get data from lots of crafty users with duals showing that somehow he
managed to produce bogus data. Whether he made it up or not I won't try to
divine.
If you want to ignore the paper/data, by all means do so. Of course you
can't ignore Vincent's paper/data because he hasn't written one, and he
probably never will. And after some of his claims in recent years, I
doubt any technically competent person would take it seriously anyway.
YMMV of course.
Vincent reminds me of the old IBM salesman jokes we told in the 1970's when IBM
was known to promise great things but not deliver:
"newlywed calls her mom at 4am on the honeymoon night and says 'mom, come pick
me up, this is horrible.' Mom says 'honey, what is wrong, sex not going like
you expected? You can expect him to be a bit 'demanding' the first year or so
and over-do it... your father did.' The daughter responded "too much sex?
Mom, he hasn't touched me. He's spent all night telling me how great it is
going to be..."
Sounds like it was talking about Vincent. He's been telling us how great Diep
is going to be for years... And about how bad Tiger, Fritz, Rebel, Hiarcs, etc
are. But he just can't seem to beat them... Neither can I but I don't make the
wild claims/promises/statements...
Draw your own conclusions about the DTS paper...
This page took 0.04 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.