Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: GCC inline asm. Interesting notes, particularly about atomic locks.

Author: Robert Hyatt
Date: 21:36:02 12/21/02
On December 20, 2002 at 16:30:20, Dezhi Zhao wrote:

>On December 19, 2002 at 11:40:11, Robert Hyatt wrote:
>
>>For those remembering the discussion from a couple of weeks ago, I
>>had run into a strange problem with getting an inline asm lock to
>>work.  I was playing with this because I was following the Intel
>>guideline of adding a "pause" to the "shadow-lock" part of the code.
>>
>>First, the inline asm code:
>>
>>void static __inline__ LockX86(volatile int * lock) {
>>        int dummy;
>>        asm __volatile__ (
>>            "1:          movl    $1, %0"           "\n\t"
>>            "            xchgl   (%1), %0"         "\n\t"
>>            "            testl   %0, %0"           "\n\t"
>>            "            jz      3f"               "\n\t"
>>            "2:          pause"                    "\n\t"
>>            "            movl    (%1), %0"         "\n\t"
>>            "            testl   %0, %0"           "\n\t"
>> ****       "            jnz     2b"               "\n\t"
>>            "            jmp     1b"               "\n\t"
>>            "3:"                                   "\n\t"
>>            : "=&q" (dummy)
>>            : "q" (lock)
>>            : "cc");
>>}
>>
>
>Can we safely delete xchgl (%1), %0 instruction here? There is a similar example
>in the Intel spin lock application note. However the Intel example goes without
>an xchg instruction.
>

There are multiple ways to do it with the LOCK prefix.  The idea is that the
instruction has to fetch the old value _and_ change it in one indivisible
operation.  Some machines use test and set, some use fetch and add, some do
it in other ways...

Doesn't matter how, just that the operation is indivisible.


>>First, the lock now works.  The bug was on the **** line above.  I
>>had incorrectly written "jz".  To explain the code first, read on...
>>
>>This is based on the "shadow lock" approach to avoid frying the bus
>>when a processor is spinning.  The _real_ lock must always be set/tested
>>with an atomic-type instruction, and the xchgl (xchange long) instruction
>>does this in an indivisable way.  Unfortunately, it bypasses a cache hit
>>and runs out to memory to actually grab the old value and replace it with a
>>new value while the bus is locked.
>>
>>Since I need to spin on that lock until it is a zero (assuming it is already
>>set/held by another thread) looping on the xchngl instruction would _really_
>>interfere with the other processors that are doing useful work.  In comes the
>>shadow lock.
>>
>>If the xchgl instruction shows that the lock is already non-zero, I jump
>>to a loop that tests this value with a movl instruction which loops on the
>>value stored in cache.  When another processor writes back to the lock
>>variable to set it to zero, my cache line for that word gets invalidated and
>>we reload it from memory and see the new zero contents.  While I am looping
>>I don't execute an exchange instruction, just a simple move, which means my
>>processor is not accessing the memory bus at all, letting the other processors
>>run as fast as possible.  When the move finds a zero value, it goes back to the
>>exchange instruction to do the test again atomically.  If it is still zero, we
>>exit the loop, otherwise we hit the shadow spinlock again and spin on cache.
>>
>>This is important because in the above code, the **** instruction used to be
>>"jz" which is wrong.  Because it causes an infinite loop.  If, when the lock
>>code is executed, the lock is zero, the exchange and then test/jz instructions
>>will take me out of the lock as we found a zero and it is now set to a 1.  But
>>if the lock is initially 1, I would hit the shadow spin loop, and I would loop
>>if the lock had been cleared, which is bad, or I would jump back to the exchange
>>if the lock was still 1, which is wrong also.  But the point is the loop would
>>hang if, I entered the code with the lock set to 1 already, and someone cleared
>>it while I was between the exchange and the shadow lock loop.
>>
>>And it did hang, but it actually played complete games on ICC without a
>>problem, and then it would hang in three consecutive games and lose on time.
>>
>>Why is that interesting?  It suggests that for the most part, when this code
>>is called, the lock is zero.  Which is what I had thought all along with just
>>four threads running.  This means that the _locks_ are not really affecting
>>my search speed, contrary to what "some" would like to suggest.  If I were to
>>use 16 processors, I'm sure it would happen more often.  But for now, the locks
>>do not appear to be a performance bottleneck.
>>
>>BTW the above code works fine if you are running linux.  Or using gcc/gas on
>>other platforms.  It is not quite microsoft syntax as MS reverses the
>>operands to dst, src rather than the ATT approach of src, dst.  Also there
>>are other syntactical issues dealing with [] vs () and $constant and so forth.
>>
>>I think I am next going to work on making the other asm all work by inlining
>>it to dump a lot of call instruction overhead that is scattered around...
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.