Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: GCC inline asm. Interesting notes, particularly about atomic locks.

Author: Dezhi Zhao

Date: 13:30:20 12/20/02

Go up one level in this thread


On December 19, 2002 at 11:40:11, Robert Hyatt wrote:

>For those remembering the discussion from a couple of weeks ago, I
>had run into a strange problem with getting an inline asm lock to
>work.  I was playing with this because I was following the Intel
>guideline of adding a "pause" to the "shadow-lock" part of the code.
>
>First, the inline asm code:
>
>void static __inline__ LockX86(volatile int * lock) {
>        int dummy;
>        asm __volatile__ (
>            "1:          movl    $1, %0"           "\n\t"
>            "            xchgl   (%1), %0"         "\n\t"
>            "            testl   %0, %0"           "\n\t"
>            "            jz      3f"               "\n\t"
>            "2:          pause"                    "\n\t"
>            "            movl    (%1), %0"         "\n\t"
>            "            testl   %0, %0"           "\n\t"
> ****       "            jnz     2b"               "\n\t"
>            "            jmp     1b"               "\n\t"
>            "3:"                                   "\n\t"
>            : "=&q" (dummy)
>            : "q" (lock)
>            : "cc");
>}
>

Can we safely delete xchgl (%1), %0 instruction here? There is a similar example
in the Intel spin lock application note. However the Intel example goes without
an xchg instruction.

>First, the lock now works.  The bug was on the **** line above.  I
>had incorrectly written "jz".  To explain the code first, read on...
>
>This is based on the "shadow lock" approach to avoid frying the bus
>when a processor is spinning.  The _real_ lock must always be set/tested
>with an atomic-type instruction, and the xchgl (xchange long) instruction
>does this in an indivisable way.  Unfortunately, it bypasses a cache hit
>and runs out to memory to actually grab the old value and replace it with a
>new value while the bus is locked.
>
>Since I need to spin on that lock until it is a zero (assuming it is already
>set/held by another thread) looping on the xchngl instruction would _really_
>interfere with the other processors that are doing useful work.  In comes the
>shadow lock.
>
>If the xchgl instruction shows that the lock is already non-zero, I jump
>to a loop that tests this value with a movl instruction which loops on the
>value stored in cache.  When another processor writes back to the lock
>variable to set it to zero, my cache line for that word gets invalidated and
>we reload it from memory and see the new zero contents.  While I am looping
>I don't execute an exchange instruction, just a simple move, which means my
>processor is not accessing the memory bus at all, letting the other processors
>run as fast as possible.  When the move finds a zero value, it goes back to the
>exchange instruction to do the test again atomically.  If it is still zero, we
>exit the loop, otherwise we hit the shadow spinlock again and spin on cache.
>
>This is important because in the above code, the **** instruction used to be
>"jz" which is wrong.  Because it causes an infinite loop.  If, when the lock
>code is executed, the lock is zero, the exchange and then test/jz instructions
>will take me out of the lock as we found a zero and it is now set to a 1.  But
>if the lock is initially 1, I would hit the shadow spin loop, and I would loop
>if the lock had been cleared, which is bad, or I would jump back to the exchange
>if the lock was still 1, which is wrong also.  But the point is the loop would
>hang if, I entered the code with the lock set to 1 already, and someone cleared
>it while I was between the exchange and the shadow lock loop.
>
>And it did hang, but it actually played complete games on ICC without a
>problem, and then it would hang in three consecutive games and lose on time.
>
>Why is that interesting?  It suggests that for the most part, when this code
>is called, the lock is zero.  Which is what I had thought all along with just
>four threads running.  This means that the _locks_ are not really affecting
>my search speed, contrary to what "some" would like to suggest.  If I were to
>use 16 processors, I'm sure it would happen more often.  But for now, the locks
>do not appear to be a performance bottleneck.
>
>BTW the above code works fine if you are running linux.  Or using gcc/gas on
>other platforms.  It is not quite microsoft syntax as MS reverses the
>operands to dst, src rather than the ATT approach of src, dst.  Also there
>are other syntactical issues dealing with [] vs () and $constant and so forth.
>
>I think I am next going to work on making the other asm all work by inlining
>it to dump a lot of call instruction overhead that is scattered around...



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.