Lock-free programming is difficult; my advice to anyone writing C is to avoid it if at all possible. Writing lock-free code requires a strong understanding of compiler and processor behavior, which frankly most people don’t have, and if you aren’t careful it can produce extremely subtle bugs and reduce portability across architectures. Unless you have profiled and know you need lock-free programming, you probably don’t.

x86 vs ARM

The most common architecture transition nowadays is taking code written for x86 and making sure it runs correctly on ARM, brought about by Apple’s new M1 chip. ARM has a more relaxed memory model than x86, and so code that is benign on x86 may be broken on ARM.

Bruce Dawson has a brief article explaining the difference. If you aren’t familiar, I recommend giving this a read.

x86 and ARM/PowerPC differences are, in my experience, pretty well documented. Lock-free programming is still extremely difficult, but if you want to make sure your code is correct on both architectures there are plenty of resources to help you out.

ARM vs Alpha

A lot of documentation on lock-free programming seems to explicitly mention Alpha, but only to immediately lump it in with ARM/PowerPC. This is a mistake—Alpha is special.

Alpha

For those unfamiliar, Alpha is an older RISC architecture designed by Digital Equipment Corporation. It was first released in 1992, and in 2001 the IP was sold to Intel, effectively killing it off. It was extremely powerful compared to its competition at Intel and HP at the time, and innovations stemming from it influenced other processors in the years to come.

Of relevance to this piece, however, is Alpha’s exceedingly lax memory model. Unlike ARM, which will respect data dependencies for reordering, Alpha can reorder reads regardless. This is really hard to reason about, even for people comfortable with lock-free programming.

Simple case

The particular situation that prompted this piece, seen in real code, is lock-free singleton initialization. We’ll start with a simpler case, a static int, that avoids abstracting some of the details:

int get_foo() {
    static int *gfoo;

    if (gfoo)
        return *gfoo;

    int *foo = malloc();
    *foo = rand();

    FullBarrier(); // just to be explicit
    if (InterlockedCompareExchange(&gfoo, foo, NULL))
        free(foo);

    return *gfoo;
}

This example will work fine on x86/ARM, but is busted on Alpha. It doesn’t guarantee any ordering between the read of gfoo and the read of *gfoo later on as the full barrier is not on the fast path, so it’s possible for a caller to read uninitialized memory.

This is hard to follow, so here’s an example of how the reordering could happen on Alpha to produce that result:

Writer Thread Reader Thread
foo = malloc() returns address 0x1000, containing the value 0xDEADBEEF  
  Read *gfoo. The value, which we will get in the fifth row, will be 0x1000, so this read returns *0x1000 == 0xDEADBEEF
*foo = rand() sets *0x1000 to 7  
InterlockedCompareExchange sets gfoo to 0x1000  
  Read gfoo == 0x1000
  Return 0xDEADBEEF

Isn’t that cool? This is basically indirecting through a pointer before reading the pointer value itself! I mean, it’s also batshit insane, but hey, still cool. In the words of the person who explained this to me:

“Time is not a straight line. Alpha is exceedingly hard to reason about.”

The fix is to follow the usual principle that barriers must come in pairs, making the read barrier explicit:

int get_foo() {
    static int *gfoo;

    if (gfoo) {
        ReadBarrier();
        return *gfoo;
    }

    int *foo = malloc();
    *foo = rand();

    FullBarrier();
    if (InterlockedCompareExchange(&gfoo, foo, NULL))
        free(foo);

    return *gfoo;
}

Realistic case

Below I’ve included a conceptually similar example that more closely resembles code you might see in the wild. You can assume some_initializer() is calling malloc and initializing a struct, and that global_pointer is predefined and immutable after initialization.

if (global_pointer)
    return global_pointer;

void *local_pointer = some_initializer();

if (InterlockedCompareExchange(&global_pointer, local_pointer, NULL))
    some_cleanup_function(local_pointer);

return global_pointer;

This is similarly broken, and the fix is the same: adding the explicit read barrier after the load for global_pointer.

I also want to note that locking around the initializer does not help. If some_initializer() cannot run concurrently, you might be tempted to write something like the following:

if (global_pointer)
    return global_pointer;

acquire_global_lock();

if (!global_pointer) {
    local_pointer = some_initializer();
    full_barrier();
    global_pointer = local_pointer;
}

release_global_lock();

return global_pointer;

This exhibits the exact same issue. It is referred to as “double-checked locking” and considered an anti-pattern.

Fin

I might one day go back and edit in a section on options for dealing with this in C++, but otherwise that’s it! Hopefully you learned something about Alpha, and if you have any further questions feel free to shoot me an email.