This article is about a GHC bug I encountered recently, but it's really an excuse to talk about some GHC internals at an intro level. (In turn, an excuse for me to learn about those internals.)
I'll assume you're familiar with the basics of Haskell and lazy evaluation.
The bug
I spoke before of using global locks in Haskell to protect a thread-unsafe C library. Unfortunately a GHC bug prevents this from working. Using unsafePerformIO
at the top level of a file can result in IO that happens more than once.
Here is a simple program which illustrates the problem:
import Control.Concurrent
import Control.Monad
import System.IO.Unsafe
ioThunk :: ()
ioThunk = unsafePerformIO $ do
me <- myThreadId
putStrLn ("IO executed by " ++ show me)
{-# NOINLINE ioThunk #-}
main :: IO ()
main = do
replicateM_ 100 (forkIO (print ioThunk))
threadDelay 10000 -- wait for other threads
Let's test this, following the compiler flag recommendations for unsafePerformIO
.
$ ghc -V
The Glorious Glasgow Haskell Compilation System, version 7.2.1
$ ghc -rtsopts -threaded -fno-cse -fno-full-laziness dupe.hs
[1 of 1] Compiling Main ( dupe.hs, dupe.o )
Linking dupe ...
$ while true; do ./dupe +RTS -N | head -n 2; echo ----; done
Within a few seconds I see output like this:
----
IO executed by ThreadId 35
()
----
IO executed by ThreadId 78
IO executed by ThreadId 85
----
IO executed by ThreadId 48
()
----
In the middle run, two threads executed the IO action.
This bug was reported two weeks ago and is already fixed in GHC HEAD. I tested with GHC 7.3.20111026, aka g6f5b798
, and the problem seemed to go away.
Unfortunately it will be some time before GHC 7.4 is widely deployed, so I'm thinking about workarounds for my original global locking problem. I'll probably store the lock in a C global variable via StablePtr
, or failing that, implement all locking in C. But I'd appreciate any other suggestions.
The remainder of this article is an attempt to explain this GHC bug, and the fix committed by Simon Marlow. It's long because
I try not to assume you know anything about how GHC works. I don't know very much, myself.
There are various digressions.
Objects at runtime
Code produced by GHC can allocate many kinds of objects. Here are just a few:
CONSTR
objects represent algebraic data constructors and their associated fields. The value (Just 'x')
would be represented by a CONSTR
object, holding a pointer to another object representing 'x'
.
FUN
objects represent functions, like the value (\x -> x+1)
.
THUNK
objects represent computations which have not yet happened. Suppose we write:
let x = 2 + 2 in f x x
This code will construct a THUNK
object for x
and pass it to the code for f
. Some time later, f
may force evaluation of its argument, and the thunk will, in turn, invoke (+)
. When the thunk has finished evaluating, it is overwritten with the evaluation result. (Here, this might be an I#
CONSTR
holding the number 4.) If f
then forces its second argument, which is also x
, the work done by (+)
is not repeated. This is the essence of lazy evaluation.
When a thunk is forced, it's first overwritten with a BLACKHOLE
object. This BLACKHOLE
is eventually replaced with the evaluation result. Therefore a BLACKHOLE
represents a thunk which is currently being evaluated.
Identifying this case helps the garbage collector, and it also gives GHC its seemingly magical ability to detect some infinite loops. Forcing a BLACKHOLE
indicates a computation which cannot proceed until the same computation has finished. The GHC runtime will terminate the program with a <<loop>>
exception.
We can't truly update thunks in place, because the evaluation result might be larger than the space originally allocated for the thunk. So we write an indirection pointing to the evaluation result. These IND
objects will later be removed by the garbage collector.
Static objects
Dynamically-allocated objects make sense for values which are created as your program runs. But the top-level declarations in a Haskell module don't need to be dynamically allocated; they already exist when your program starts up. GHC allocates these static objects in your executable's data section, the same place where C global variables live.
Consider this program:
x = Just 'x'
f (Just _) = \y -> y+1
main = print (f x 3)
Ignoring optimizations, GHC will produce code where:
x
is a CONSTR_STATIC
object.
f
is a FUN_STATIC
object. When called, f
will return a dynamically-allocated FUN
object representing (\y -> y+1)
.
main
is a THUNK_STATIC
object. It represents the unevaluated expression formed by applying the function print
to the argument (f x 3)
. A static thunk is also known as a constant applicative form, or a CAF for short. Like any other thunk, a CAF may or may not get evaluated. If evaluated, it will be replaced with a black hole and eventually the evaluation result. In this example, main
will be evaluated by the runtime system, in deciding what IO to perform.
Black holes and revelations
That's all fine for a single-threaded Haskell runtime, but GHC supports running many Haskell threads across multiple OS threads. This introduces some additional complications. For example, one thread might force a thunk which is currently being evaluated by another thread. The thread will find a BLACKHOLE
, but terminating the program would be incorrect. Instead the BLACKHOLE
puts the current Haskell thread to sleep, and wakes it up when the evaluation result is ready.
If two threads force the same thunk at the same time, they will both perform the deferred computation. We could avoid this wasted effort by writing and checking for black holes using expensive atomic memory operations. But this is a poor tradeoff; we slow down every evaluation in order to prevent a rare race condition.
As a compiler for a language with pure evaluation, GHC has the luxury of tolerating some duplicated computation. Evaluating an expression twice can't change a program's behavior. And most thunks are cheap to evaluate, hardly worth the effort of avoiding duplication. So GHC follows a "lazy black-holing" strategy. Threads write black holes only when they enter the garbage collector. If a thread discovers that one of its thunks has already been claimed, it will abandon the duplicated work-in-progress. This scheme avoids large wasted computations without paying the price on small computations. You can find the gritty details within the function threadPaused
, in rts/ThreadPaused.c
.
You may remember that we started, all those many words ago, with a program that uses unsafePerformIO
. This breaks the pure-evaluation property of Haskell. Repeated evaluation will affect semantics! Might lazy black-holing be the culprit in the original bug?
Naturally, the GHC developers thought about this case. Here's the implementation of unsafePerformIO
:
unsafePerformIO m = unsafeDupablePerformIO (noDuplicate >> m)
noDuplicate = IO $ \s -> case noDuplicate# s of s' -> (# s', () #)
unsafeDupablePerformIO (IO m) = lazy (case m realWorld# of (# _, r #) -> r)
The core behavior is implemented by unsafeDupablePerformIO
, using GHC's internal representation of IO actions (which is beyond the scope of this article, to the extent I even have a scope in mind). As the name suggests, unsafeDupablePerformIO
provides no guarantee against duplicate execution. The more familiar unsafePerformIO
builds this guarantee by first invoking the noDuplicate#
primitive operation.
The implementation of noDuplicate#
, written in GHC's Cmm
intermediate language, handles a few tricky considerations. But it's basically a call to the function threadPaused
, which we saw is responsible for lazy black-holing. In other words, thunks built from unsafePerformIO
perform eager black-holing.
Since threadPaused
has to walk the evaluation stack, unsafeDupablePerformIO
might be much faster than unsafePerformIO
. In practice, this will matter when performing a great number of very quick IO actions, like peek
ing a single byte from memory. In this case it is safe to duplicate IO, provided the buffer is unchanging. Let's measure the performance difference.
import GHC.IO
import Foreign hiding (unsafePerformIO)
import System.Random
import Criterion.Main
main = do
let sz = 1024*1024
buf <- mallocBytes sz
let get i = peekByteOff buf i :: IO Word8
peek_d i = unsafeDupablePerformIO (get i)
peek_n i = unsafePerformIO (get i)
idxes = take 1024 $ randomRs (0, sz-1) (mkStdGen 49)
evaluate (sum idxes) -- force idxes ahead of time
defaultMain
[ bench "dup" $ nf (map peek_d) idxes
, bench "noDup" $ nf (map peek_n) idxes ]
And the results:
$ ghc -rtsopts -threaded -O2 peek.hs && ./peek +RTS -N
...
benchmarking dup
mean: 76.42962 us, lb 75.11134 us, ub 78.18593 us, ci 0.950
std dev: 7.764123 us, lb 6.300310 us, ub 9.790345 us, ci 0.950
benchmarking noDup
mean: 142.1720 us, lb 139.7312 us, ub 145.4300 us, ci 0.950
std dev: 14.43673 us, lb 11.40254 us, ub 17.86663 us, ci 0.950
So performance-critical idempotent actions can benefit from unsafeDupablePerformIO
. But most code should use the safer unsafePerformIO
, as our bug reproducer does. And the noDuplicate#
machinery for unsafePerformIO
makes sense, so what's causing our bug?
The bug, finally
After all those details and diversions, let's go back to the fix for GHC bug #5558. The action is mostly in rts/sm/Storage.c
. This file is part of GHC's storage manager, which provides services such as garbage collection.
Recall that our problematic code looked like this:
ioThunk :: ()
ioThunk = unsafePerformIO $ do ...
This is an application of the function ($)
to the argument unsafePerformIO
. So it's a static thunk, a CAF. Here's the old description of how CAF evaluation works, from Storage.c
:
The entry code for every CAF does the following:
- builds a
BLACKHOLE
in the heap
- pushes an update frame pointing to the
BLACKHOLE
- calls
newCaf
, below
- updates the CAF with a static indirection to the
BLACKHOLE
Why do we build an BLACKHOLE
in the heap rather than just updating the thunk directly? It's so that we only need one kind of update frame - otherwise we'd need a static version of the update frame too.
So here's the problem. Normal thunks get blackholed in place, and a thread detects duplicated evaluation by noticing that one of its thunks-in-progress became a BLACKHOLE
. But static thunks — CAFs — are blackholed by indirection. Two threads might perform the above procedure concurrently, producing two different heap-allocated BLACKHOLE
s, and they'd never notice.
As Simon Marlow put it:
Note [atomic CAF entry]
With THREADED_RTS
, newCaf()
is required to be atomic (see #5558). This is because if two threads happened to enter the same CAF simultaneously, they would create two distinct CAF_BLACKHOLEs
, and so the normal threadPaused()
machinery for detecting duplicate evaluation will not detect this. Hence in lockCAF()
below, we atomically lock the CAF with WHITEHOLE
before updating it with IND_STATIC
, and return zero if another thread locked the CAF first. In the event that we lost the race, CAF entry code will re-enter the CAF and block on the other thread's CAF_BLACKHOLE
.
I can't explain precisely what a WHITEHOLE
means, but they're used for spin locks or wait-free synchronization in various places. For example, the MVar
primitives are synchronized by the lockClosure
spinlock routine, which uses WHITEHOLE
s.
The fix
Here's the corrected CAF evaluation procedure:
The entry code for every CAF does the following:
- builds a
CAF_BLACKHOLE
in the heap
- calls
newCaf
, which atomically updates the CAF with IND_STATIC
pointing to the CAF_BLACKHOLE
- if
newCaf
returns zero, it re-enters the CAF (see Note [atomic CAF entry])
- pushes an update frame pointing to the
CAF_BLACKHOLE
newCAF
is made atomic by introducing a new helper function, lockCAF
, which is reproduced here for your viewing pleasure:
STATIC_INLINE StgWord lockCAF (StgClosure *caf, StgClosure *bh)
{
const StgInfoTable *orig_info;
orig_info = caf->header.info;
#ifdef THREADED_RTS
const StgInfoTable *cur_info;
if (orig_info == &stg_IND_STATIC_info ||
orig_info == &stg_WHITEHOLE_info) {
// already claimed by another thread; re-enter the CAF
return 0;
}
cur_info = (const StgInfoTable *)
cas((StgVolatilePtr)&caf->header.info,
(StgWord)orig_info,
(StgWord)&stg_WHITEHOLE_info);
if (cur_info != orig_info) {
// already claimed by another thread; re-enter the CAF
return 0;
}
// successfully claimed by us; overwrite with IND_STATIC
#endif
// For the benefit of revertCAFs(), save the original info pointer
((StgIndStatic *)caf)->saved_info = orig_info;
((StgIndStatic*)caf)->indirectee = bh;
write_barrier();
SET_INFO(caf,&stg_IND_STATIC_info);
return 1;
}
We grab the CAF's info table pointer, which tells us what kind of object it is. If it's not already claimed by another thread, we write a WHITEHOLE
— but only if the CAF hasn't changed in the meantime. This step is an atomic compare-and-swap, implemented by architecture-specific code. The function cas
is specified by this pseudocode:
cas(p,o,n) {
atomically {
r = *p;
if (r == o) { *p = n };
return r;
}
}
Here's the implementation for x86, using GCC extended inline assembly:
EXTERN_INLINE StgWord
cas(StgVolatilePtr p, StgWord o, StgWord n)
{
__asm__ __volatile__ (
"lock\ncmpxchg %3,%1"
:"=a"(o), "=m" (*(volatile unsigned int *)p)
:"0" (o), "r" (n));
return o;
}
There are some interesting variations between architectures. SPARC and x86 use single instructions, while PowerPC and ARMv6 have longer sequences. Old ARM processors require a global spinlock, which sounds painful. Who's running Haskell on ARMv5 chips?
*deep breath*
Thanks for reading / skimming this far! I learned a lot by writing this article, and I hope you enjoyed reading it. I'm sure I said something wrong somewhere, so please do not hesitate to correct me in the comments.