So we all know (or we should) about the potential perils of double-checked locking. These can be mostly mitigated by judicious use of the
volatile keyword, at least in languages which support it, but in this wonderful world of multi-core and multi-processor development, there are still some problems[PDF] to be aware of (there’s optimizing-compiler theory in there; if that scares you, just take my word that it’s got problems).
The paper linked above spends a fair bit of time going through all the ways an optimizing compiler can trample all over anything you do in code, including most uses of the
volatile keyword. It also goes into things like C++ constructor inlining, which causes problems unlikely to affect Objective-C since ObjC message passes are never inlined by the compiler due to the language’s dynamism. The second part of the paper, however, covers multiple processors, which is where just about anything we do except using an expensive lock will be trampled upon potentially every single time.
I’ll take a step back here to explain the problem domain, what’s going wrong, and how it could be fixed. Then I’ll show you how to do the same thing without using any locking at all. That’s usually a win.
Our problem is that we have a class method returning a lazily-initialized variable. Perhaps it’s a formatted string which can’t be typed at compile time. Perhaps it’s just something memory-intensive to create (or just to have resident) and we want to avoid going through all the necessary steps until we actually need it. This is why it’s called lazy initialization. So, we have some code like this:
That’s all fine until we hit some multi-threaded code. It’s quite possible that two threads will arrive in this function at the same time, and one will be suspended after checking
__theObject but before assigning it. The other thread then checks the value & creates/assigns its own version. Here we have a memory leak: two instances of
AnObject were allocated, but one overwrote the other.
The initial solution is pretty simple: wrap the whole thing, including the check, in
@synchronized(self). This, however, is not very efficient. Apart from the fact that
@synchronized is the slowest form of locking available to you (
OSSpinLock), you’re now causing every single invocation of this method to acquire a high-level lock, even though the lock is only needed for a single case. This means that all your accesses are going to be needlessly slowed down.
This is where double-checked locking comes in: In essence, you check the value of
__theObject right away. If it’s
nil then you acquire the lock, check again, then allocate/assign if necessary. The problem this time is in CPU caches and multiple cores, where each CPU has a different cache of the code and memory. It also means that you’re acquiring a lock again and potentially going into contention over that lock. It’d be much nicer if we had pure lockless algorithms to rely on, wouldn’t it? With that in mind, we’ll look at the problem again, to see what’s actually at the root— what is the exact behaviour we’re trying to prevent here?
The issue is with two threads creating their own instances of
AnObject, and one of those going away. The compiler will do its best to prevent any usefulness of such things as ‘check, allocate, check again, assign’, as will the whole concurrency thing (your thread could be preempted between ‘check again’ and ‘assign’, during which another thread could do the assign). And yet that’s pretty much exactly what we need to do — assign a new value only if the current value is nil. Why didn’t anyone think of this when designing these wonderfully complicated systems?
The answer, of course, is that they did. The CPUs we’re interested in (PowerPC, IA-32, IA-64, X86-64, and ARM) all have single instructions which perform what is called a ‘compare and exchange’ operation. The basic semantics go like this:
- Get the current value from memory.
- If it’s not what we’re expecting: leave, return 0.
- If it is what we’re expecting: assign new value.
- If something else wrote to the memory behind our back: leave, return 0.
- Otherwise, the new value went in without anything else getting in the way: return 1.
That’s pretty much exactly what we’re looking for, isn’t it. But how can we tell whether something else changed the memory? Well, that’s something else the CPU can handle. After all, a significant amount of the silicon in a CPU is designed to move data into or out of main memory. This means that the CPU can put a ‘lock’ or a ‘modification flag’ or some similar conceptual construct onto the bit of memory it’s dealing with. If something changes that memory, the lock/flag will be set, thus alerting the CPU to the fact that the memory was altered outside of its own flow of execution. As to the assign part, these CPUs all have conditional operators which will only perform an operation if certain flag bits are in a given state: therefore there’s an instruction which won’t actually perform the store if the lock/flag is set.
On Mac OS X and the iPhone, these algorithms are implemented (in assembler code, naturally) by the
OSAtomicCompareAndSwap series of functions. The one we’ll use to create our optimized lazy-setter is
OSAtomicCompareAndSwapPtrBarrier(). This works on pointer-sized variables (so is the same on 32- or 64-bit architectures) and includes a memory barrier (sometimes called a gate or a fence) to ensure that the CPU doesn’t reorder any of the reads and writes around inside the compare & swap function itself.
Given the appearance of this function, we can now use an atomic testing-assignment operation which either stores the new variable into the global pointer, or else tells us that we should release our surplus object. All with no locks, multithreading-safe, and multiprocessor-safe: