This article is about a GHC bug I encountered recently, but it's really an excuse to talk about some GHC internals at an intro level. (In turn, an excuse for me to learn about those internals.)
I'll assume you're familiar with the basics of Haskell and lazy evaluation.
The bug
I spoke before of using global locks in Haskell to protect a thread-unsafe C library. Unfortunately a GHC bug prevents this from working. Using unsafePerformIO at the top level of a file can result in IO that happens more than once.
Here is a simple program which illustrates the problem:
import Control.Concurrent
import Control.Monad
import System.IO.Unsafe
ioThunk :: ()
ioThunk = unsafePerformIO $ do
me <- myThreadId
putStrLn ("IO executed by " ++ show me)
{-# NOINLINE ioThunk #-}
main :: IO ()
main = do
replicateM_ 100 (forkIO (print ioThunk))
threadDelay 10000 -- wait for other threads
Let's test this, following the compiler flag recommendations for unsafePerformIO.
$ ghc -V
The Glorious Glasgow Haskell Compilation System, version 7.2.1
$ ghc -rtsopts -threaded -fno-cse -fno-full-laziness dupe.hs
[1 of 1] Compiling Main ( dupe.hs, dupe.o )
Linking dupe ...
$ while true; do ./dupe +RTS -N | head -n 2; echo ----; done
Within a few seconds I see output like this:
----
IO executed by ThreadId 35
()
----
IO executed by ThreadId 78
IO executed by ThreadId 85
----
IO executed by ThreadId 48
()
----
In the middle run, two threads executed the IO action.
This bug was reported two weeks ago and is already fixed in GHC HEAD. I tested with GHC 7.3.20111026, aka g6f5b798, and the problem seemed to go away.
Unfortunately it will be some time before GHC 7.4 is widely deployed, so I'm thinking about workarounds for my original global locking problem. I'll probably store the lock in a C global variable via StablePtr, or failing that, implement all locking in C. But I'd appreciate any other suggestions.
The remainder of this article is an attempt to explain this GHC bug, and the fix committed by Simon Marlow. It's long because
I try not to assume you know anything about how GHC works. I don't know very much, myself.
There are various digressions.
Objects at runtime
Code produced by GHC can allocate many kinds of objects. Here are just a few:
CONSTRobjects represent algebraic data constructors and their associated fields. The value(Just 'x')would be represented by aCONSTRobject, holding a pointer to another object representing'x'.FUNobjects represent functions, like the value(\x -> x+1).THUNKobjects represent computations which have not yet happened. Suppose we write:let x = 2 + 2 in f x xThis code will construct a
THUNKobject forxand pass it to the code forf. Some time later,fmay force evaluation of its argument, and the thunk will, in turn, invoke(+). When the thunk has finished evaluating, it is overwritten with the evaluation result. (Here, this might be anI#CONSTRholding the number 4.) Iffthen forces its second argument, which is alsox, the work done by(+)is not repeated. This is the essence of lazy evaluation.When a thunk is forced, it's first overwritten with a
BLACKHOLEobject. ThisBLACKHOLEis eventually replaced with the evaluation result. Therefore aBLACKHOLErepresents a thunk which is currently being evaluated.Identifying this case helps the garbage collector, and it also gives GHC its seemingly magical ability to detect some infinite loops. Forcing a
BLACKHOLEindicates a computation which cannot proceed until the same computation has finished. The GHC runtime will terminate the program with a<<loop>>exception.We can't truly update thunks in place, because the evaluation result might be larger than the space originally allocated for the thunk. So we write an indirection pointing to the evaluation result. These
INDobjects will later be removed by the garbage collector.
Static objects
Dynamically-allocated objects make sense for values which are created as your program runs. But the top-level declarations in a Haskell module don't need to be dynamically allocated; they already exist when your program starts up. GHC allocates these static objects in your executable's data section, the same place where C global variables live.
Consider this program:
x = Just 'x'
f (Just _) = \y -> y+1
main = print (f x 3)
Ignoring optimizations, GHC will produce code where:
xis aCONSTR_STATICobject.fis aFUN_STATICobject. When called,fwill return a dynamically-allocatedFUNobject representing(\y -> y+1).mainis aTHUNK_STATICobject. It represents the unevaluated expression formed by applying the functionprintto the argument(f x 3). A static thunk is also known as a constant applicative form, or a CAF for short. Like any other thunk, a CAF may or may not get evaluated. If evaluated, it will be replaced with a black hole and eventually the evaluation result. In this example,mainwill be evaluated by the runtime system, in deciding what IO to perform.
Black holes and revelations
That's all fine for a single-threaded Haskell runtime, but GHC supports running many Haskell threads across multiple OS threads. This introduces some additional complications. For example, one thread might force a thunk which is currently being evaluated by another thread. The thread will find a BLACKHOLE, but terminating the program would be incorrect. Instead the BLACKHOLE puts the current Haskell thread to sleep, and wakes it up when the evaluation result is ready.
If two threads force the same thunk at the same time, they will both perform the deferred computation. We could avoid this wasted effort by writing and checking for black holes using expensive atomic memory operations. But this is a poor tradeoff; we slow down every evaluation in order to prevent a rare race condition.
As a compiler for a language with pure evaluation, GHC has the luxury of tolerating some duplicated computation. Evaluating an expression twice can't change a program's behavior. And most thunks are cheap to evaluate, hardly worth the effort of avoiding duplication. So GHC follows a "lazy black-holing" strategy.12 Threads write black holes only when they enter the garbage collector. If a thread discovers that one of its thunks has already been claimed, it will abandon the duplicated work-in-progress. This scheme avoids large wasted computations without paying the price on small computations. You can find the gritty details within the function threadPaused, in rts/ThreadPaused.c.
unsafe[Dupable]PerformIO
You may remember that we started, all those many words ago, with a program that uses unsafePerformIO. This breaks the pure-evaluation property of Haskell. Repeated evaluation will affect semantics! Might lazy black-holing be the culprit in the original bug?
Naturally, the GHC developers thought about this case. Here's the implementation of unsafePerformIO:
unsafePerformIO m = unsafeDupablePerformIO (noDuplicate >> m)
noDuplicate = IO $ \s -> case noDuplicate# s of s' -> (# s', () #)
unsafeDupablePerformIO (IO m) = lazy (case m realWorld# of (# _, r #) -> r)
The core behavior is implemented by unsafeDupablePerformIO, using GHC's internal representation of IO actions (which is beyond the scope of this article, to the extent I even have a scope in mind). As the name suggests, unsafeDupablePerformIO provides no guarantee against duplicate execution. The more familiar unsafePerformIO builds this guarantee by first invoking the noDuplicate# primitive operation.
The implementation of noDuplicate#, written in GHC's Cmm intermediate language, handles a few tricky considerations. But it's basically a call to the function threadPaused, which we saw is responsible for lazy black-holing. In other words, thunks built from unsafePerformIO perform eager black-holing.
Since threadPaused has to walk the evaluation stack, unsafeDupablePerformIO might be much faster than unsafePerformIO. In practice, this will matter when performing a great number of very quick IO actions, like peeking a single byte from memory. In this case it is safe to duplicate IO, provided the buffer is unchanging. Let's measure the performance difference.
import GHC.IO
import Foreign hiding (unsafePerformIO)
import System.Random
import Criterion.Main
main = do
let sz = 1024*1024
buf <- mallocBytes sz
let get i = peekByteOff buf i :: IO Word8
peek_d i = unsafeDupablePerformIO (get i)
peek_n i = unsafePerformIO (get i)
idxes = take 1024 $ randomRs (0, sz-1) (mkStdGen 49)
evaluate (sum idxes) -- force idxes ahead of time
defaultMain
[ bench "dup" $ nf (map peek_d) idxes
, bench "noDup" $ nf (map peek_n) idxes ]
And the results:
$ ghc -rtsopts -threaded -O2 peek.hs && ./peek +RTS -N
...
benchmarking dup
mean: 76.42962 us, lb 75.11134 us, ub 78.18593 us, ci 0.950
std dev: 7.764123 us, lb 6.300310 us, ub 9.790345 us, ci 0.950
benchmarking noDup
mean: 142.1720 us, lb 139.7312 us, ub 145.4300 us, ci 0.950
std dev: 14.43673 us, lb 11.40254 us, ub 17.86663 us, ci 0.950
So performance-critical idempotent actions can benefit from unsafeDupablePerformIO. But most code should use the safer unsafePerformIO, as our bug reproducer does. And the noDuplicate# machinery for unsafePerformIO makes sense, so what's causing our bug?
The bug, finally
After all those details and diversions, let's go back to the fix for GHC bug #5558. The action is mostly in rts/sm/Storage.c. This file is part of GHC's storage manager, which provides services such as garbage collection.
Recall that our problematic code looked like this:
ioThunk :: ()
ioThunk = unsafePerformIO $ do ...
This is an application of the function ($) to the argument unsafePerformIO. So it's a static thunk, a CAF. Here's the old description of how CAF evaluation works, from Storage.c:
The entry code for every CAF does the following:
- builds a
BLACKHOLEin the heap- pushes an update frame pointing to the
BLACKHOLE- calls
newCaf, below- updates the CAF with a static indirection to the
BLACKHOLEWhy do we build an
BLACKHOLEin the heap rather than just updating the thunk directly? It's so that we only need one kind of update frame - otherwise we'd need a static version of the update frame too.
So here's the problem. Normal thunks get blackholed in place, and a thread detects duplicated evaluation by noticing that one of its thunks-in-progress became a BLACKHOLE. But static thunks — CAFs — are blackholed by indirection. Two threads might perform the above procedure concurrently, producing two different heap-allocated BLACKHOLEs, and they'd never notice.
As Simon Marlow put it:
Note [atomic CAF entry]
With
THREADED_RTS,newCaf()is required to be atomic (see #5558). This is because if two threads happened to enter the same CAF simultaneously, they would create two distinctCAF_BLACKHOLEs, and so the normalthreadPaused()machinery for detecting duplicate evaluation will not detect this. Hence inlockCAF()below, we atomically lock the CAF withWHITEHOLEbefore updating it withIND_STATIC, and return zero if another thread locked the CAF first. In the event that we lost the race, CAF entry code will re-enter the CAF and block on the other thread'sCAF_BLACKHOLE.
I can't explain precisely what a WHITEHOLE means, but they're used for spin locks or wait-free synchronization in various places. For example, the MVar primitives are synchronized by the lockClosure spinlock routine, which uses WHITEHOLEs.
The fix
Here's the corrected CAF evaluation procedure:
The entry code for every CAF does the following:
- builds a
CAF_BLACKHOLEin the heap- calls
newCaf, which atomically updates the CAF withIND_STATICpointing to theCAF_BLACKHOLE- if
newCafreturns zero, it re-enters the CAF (see Note [atomic CAF entry])- pushes an update frame pointing to the
CAF_BLACKHOLE
newCAF is made atomic by introducing a new helper function, lockCAF, which is reproduced here for your viewing pleasure:
STATIC_INLINE StgWord lockCAF (StgClosure *caf, StgClosure *bh)
{
const StgInfoTable *orig_info;
orig_info = caf->header.info;
#ifdef THREADED_RTS
const StgInfoTable *cur_info;
if (orig_info == &stg_IND_STATIC_info ||
orig_info == &stg_WHITEHOLE_info) {
// already claimed by another thread; re-enter the CAF
return 0;
}
cur_info = (const StgInfoTable *)
cas((StgVolatilePtr)&caf->header.info,
(StgWord)orig_info,
(StgWord)&stg_WHITEHOLE_info);
if (cur_info != orig_info) {
// already claimed by another thread; re-enter the CAF
return 0;
}
// successfully claimed by us; overwrite with IND_STATIC
#endif
// For the benefit of revertCAFs(), save the original info pointer
((StgIndStatic *)caf)->saved_info = orig_info;
((StgIndStatic*)caf)->indirectee = bh;
write_barrier();
SET_INFO(caf,&stg_IND_STATIC_info);
return 1;
}
We grab the CAF's info table pointer, which tells us what kind of object it is. If it's not already claimed by another thread, we write a WHITEHOLE — but only if the CAF hasn't changed in the meantime. This step is an atomic compare-and-swap, implemented by architecture-specific code. The function cas is specified by this pseudocode:
cas(p,o,n) {
atomically {
r = *p;
if (r == o) { *p = n };
return r;
}
}
Here's the implementation for x86, using GCC extended inline assembly:
EXTERN_INLINE StgWord
cas(StgVolatilePtr p, StgWord o, StgWord n)
{
__asm__ __volatile__ (
"lock\ncmpxchg %3,%1"
:"=a"(o), "=m" (*(volatile unsigned int *)p)
:"0" (o), "r" (n));
return o;
}
There are some interesting variations between architectures. SPARC and x86 use single instructions, while PowerPC and ARMv6 have longer sequences. Old ARM processors require a global spinlock, which sounds painful. Who's running Haskell on ARMv5 chips?
*deep breath*
Thanks for reading / skimming this far! I learned a lot by writing this article, and I hope you enjoyed reading it. I'm sure I said something wrong somewhere, so please do not hesitate to correct me in the comments.
Tim Harris, Simon Marlow, and Simon Peyton Jones. Haskell on a shared-memory multiprocessor. In Haskell '05: Proceedings of the 2005 ACM SIGPLAN workshop on Haskell, pages 49–61. ↩
Simon Marlow, Simon Peyton Jones, and Satnam Singh. Runtime Support for Multicore Haskell. In ICFP'09. ↩







