Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: New crap statement ? Perpetuum mobile

Author: Robert Hyatt

Date: 12:26:44 10/05/01

Go up one level in this thread


On October 05, 2001 at 14:08:18, Miguel A. Ballicora wrote:

>On October 05, 2001 at 12:28:34, Bruce Moreland wrote:
>
>>On October 04, 2001 at 20:14:30, Robert Hyatt wrote:
>>
>>>On October 04, 2001 at 18:41:14, Bruce Moreland wrote:
>>
>>>>Two units of work is more than twice as fast as six units, so this particular
>>>>case is super-linear, barely.  You can make it more so by making the arrays
>>>>longer.
>>>
>>>
>>>Nope.  I modify my serial algorithm to match your parallel algorithm.  Then it
>>>is not super-linear at all.  I find it after two compares, one on the first
>>>element, one on the last.  The parallel algorithm finds it after two compares
>>>also, but they are done in parallel in one cycle.  Speedup = 2.0...
>>
>>Bob, I would be an idiot if I said that you couldn't get the same speedup by
>>modifying the serial algorithm.  I'm not saying that.  I'm saying that you might
>>be able to see improbable speedups when modifying a serial algorithm by making
>>it run in parallel.
>>
>>I'd be a fool to argue that a parallel algorithm will beat a serial algorithm
>>plus a guy with a compiler and some time to modify code.
>>
>>If you are now saying that a super-linear speedup could be possible, but that a
>>guy with a compiler could fix the serial algorithm to make it run better, we
>>aren't arguing about anything.
>
>There are several levels of discussion here and there were jumps from one level
>to another. That made the discussion impossible to follow.
>
>Level 1) given a problem and a sequential algorithm (accepted as good), once the
>algorithm is "parallelized" it is possible to observe a constant superlinear
>speedup. "parallelized" means to adapt the sequential algorithm so it will work
>in parallel (two threads with double the resources). Of course, technically now
>is a different algorithm, but let's say that changes were kept at a minimum no
>to break the spirit of the original sequential algorithm.

That was the original premise...


>
>Level 2a) Once this is observed, it is possible to change the original
>sequential algorithm to make it work better, learning from the parallel version.
>However, it is possible that this modification introduces new code that will
>make an overhead that is worse than the overhead caused by doubling the
>resources. Even though the modified sequential version is better than the
>original, it is possible that cannot remove completely the superlinear speedup
>leaving a residual but measurable difference.


that last sentence is incorrect.  It is absolutely guaranteed that the super-
linear speedup will be removed, leaving _at best_ no super-linear speedup,
and more likely, a deficit.  Running a dual processor has inherent overhead
that a single processor does not.  Shared bus.  Shared memory with conflicts.
The necessity of using atomic locks that can be totally avoided in the
serial implementation of the parallel algorithm.



>
>Level 2b)
>Even if rewriting the sequential version removes the superlinear speed up making
>it 2.0, the rewriting could be extremely inconvenient and in practice, it is
>better lo live with that. Considering that Level (3) might solve the issue.


That is not an issue _at all_.  There is no limit placed on "how hard it might
be to rewrite.  For the case I have mentioned, it is trivial to search two
different parts of the tree using one kernel process.  I debugged like that
for years with Cray Blitz.





>
>Level 3) Since rewriting the sequential algorithm might not speed it up
>or is very inconvenient, a suitable alternative could be to time-slice the
>CPU and run the parallelized version in one processor making the speed exactly
>2.0 However, a balance of the overheads might make the speed up just above 2.0


"might..."  "possibly..." those are not terms you use to prove something.

My statement is simply this:

  "for any algorithm you develop that runs more than 2x faster than the
   serial algorithm, I will write a new serial algorithm that is _also_
   faster than the original serial algorithm.  Enough faster that your
   parallel speedup when compared to my new algorithm will always be
   <= 2.0 for a 2 cpu machine."

Time-slicing does not hurt performance anywhere near as much as the atomic
locks that stifle bus traffic, nor the bus conflicts where two cpus try to
use one bus, nor the memory conflicts where two cpus want to read/write data
from one memory, etc.  I gave a very simple way to compare the two.  It will
reveal that the overhead of switching back and forth between two processes
is much less than the overhead caused by SMP interactions.

And note that my simple test case was a _worst_ case for context switching.
I suggested running a known program _twice_.  Using two "heavyweight" processes.
Each context switch has to save/restore registers, flush L1/L2 cache, flush the
TLB, etc.  Using lightweight threads, only the registers have to be saved and
restored.  So the actual performance penalty for the two threads in the
"emulated parallel algorithm" would be near zero.  The memory/bus contention
is _still_ going to stick at 7% and higher.  7% for machines in the
500mhz speed range.  Double the clock, memory is no faster, the conflicts
go up.  I haven't tried it on anything beyond a dual 933, which got 11%
overhead on this test.  A dual 2ghz would probably at least double that.



>
>So, these are the levels of the discussion and I believe i did not forget
>anyone. I believe that 1, 2a and 2b cannot be proven impossible.
>Regarding 3, Bob guarantees that the hardware of the future will never have
>those characteristics. I cannot read the future so I won't discuss this point
>any further and I will stick with 1 and 2 for now.
>
>Regards,
>Miguel
>


I am waiting for someone to give an algorithm example that supports either
1, 2a or 2b happening.  I happen to know they are impossible.  Using today's
hardware.  Because I have measured the two kinds of overhead since it is an
important consideration in measuring parallel performance.  And once you know
that the "time-slicing" overhead is less on all existing machines we are
discussing, there is no real point in drifting off to "what if the hardware
could do this?..." type discussions.

super-linear is simply a myth for the average case.  For the exceptional case,
it is unusual but not non-existant.  But average case, there will never be any
sort of super-linear speedup unless there is an intentional attempt to take
advantage of a hardware issue.  Today there is no real hardware issue that can
be taken advantage of without it being obvious.  IE 2x L2 cache, etc.  And
_none_ of those issues apply to alpha/beta, as the tree is too large to make
that an issue.  Most chess programs won't even fit completely into L2, much
less their search space data.





>
>>
>>bruce



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.