Author: Robert Hyatt
Date: 08:40:11 12/19/02
For those remembering the discussion from a couple of weeks ago, I
had run into a strange problem with getting an inline asm lock to
work. I was playing with this because I was following the Intel
guideline of adding a "pause" to the "shadow-lock" part of the code.
First, the inline asm code:
void static __inline__ LockX86(volatile int * lock) {
int dummy;
asm __volatile__ (
"1: movl $1, %0" "\n\t"
" xchgl (%1), %0" "\n\t"
" testl %0, %0" "\n\t"
" jz 3f" "\n\t"
"2: pause" "\n\t"
" movl (%1), %0" "\n\t"
" testl %0, %0" "\n\t"
**** " jnz 2b" "\n\t"
" jmp 1b" "\n\t"
"3:" "\n\t"
: "=&q" (dummy)
: "q" (lock)
: "cc");
}
First, the lock now works. The bug was on the **** line above. I
had incorrectly written "jz". To explain the code first, read on...
This is based on the "shadow lock" approach to avoid frying the bus
when a processor is spinning. The _real_ lock must always be set/tested
with an atomic-type instruction, and the xchgl (xchange long) instruction
does this in an indivisable way. Unfortunately, it bypasses a cache hit
and runs out to memory to actually grab the old value and replace it with a
new value while the bus is locked.
Since I need to spin on that lock until it is a zero (assuming it is already
set/held by another thread) looping on the xchngl instruction would _really_
interfere with the other processors that are doing useful work. In comes the
shadow lock.
If the xchgl instruction shows that the lock is already non-zero, I jump
to a loop that tests this value with a movl instruction which loops on the
value stored in cache. When another processor writes back to the lock
variable to set it to zero, my cache line for that word gets invalidated and
we reload it from memory and see the new zero contents. While I am looping
I don't execute an exchange instruction, just a simple move, which means my
processor is not accessing the memory bus at all, letting the other processors
run as fast as possible. When the move finds a zero value, it goes back to the
exchange instruction to do the test again atomically. If it is still zero, we
exit the loop, otherwise we hit the shadow spinlock again and spin on cache.
This is important because in the above code, the **** instruction used to be
"jz" which is wrong. Because it causes an infinite loop. If, when the lock
code is executed, the lock is zero, the exchange and then test/jz instructions
will take me out of the lock as we found a zero and it is now set to a 1. But
if the lock is initially 1, I would hit the shadow spin loop, and I would loop
if the lock had been cleared, which is bad, or I would jump back to the exchange
if the lock was still 1, which is wrong also. But the point is the loop would
hang if, I entered the code with the lock set to 1 already, and someone cleared
it while I was between the exchange and the shadow lock loop.
And it did hang, but it actually played complete games on ICC without a
problem, and then it would hang in three consecutive games and lose on time.
Why is that interesting? It suggests that for the most part, when this code
is called, the lock is zero. Which is what I had thought all along with just
four threads running. This means that the _locks_ are not really affecting
my search speed, contrary to what "some" would like to suggest. If I were to
use 16 processors, I'm sure it would happen more often. But for now, the locks
do not appear to be a performance bottleneck.
BTW the above code works fine if you are running linux. Or using gcc/gas on
other platforms. It is not quite microsoft syntax as MS reverses the
operands to dst, src rather than the ATT approach of src, dst. Also there
are other syntactical issues dealing with [] vs () and $constant and so forth.
I think I am next going to work on making the other asm all work by inlining
it to dump a lot of call instruction overhead that is scattered around...
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.