Monday, October 31, 2011

Thunks and lazy blackholes: an introduction to GHC at runtime

This article is about a GHC bug I encountered recently, but it's really an excuse to talk about some GHC internals at an intro level. (In turn, an excuse for me to learn about those internals.)

I'll assume you're familiar with the basics of Haskell and lazy evaluation.

The bug

I spoke before of using global locks in Haskell to protect a thread-unsafe C library. Unfortunately a GHC bug prevents this from working. Using unsafePerformIO at the top level of a file can result in IO that happens more than once.

Here is a simple program which illustrates the problem:

import Control.Concurrent
import Control.Monad
import System.IO.Unsafe

ioThunk :: ()
ioThunk = unsafePerformIO $ do
me <- myThreadId
putStrLn ("IO executed by " ++ show me)
{-# NOINLINE ioThunk #-}

main :: IO ()
main = do
replicateM_ 100 (forkIO (print ioThunk))
threadDelay 10000 -- wait for other threads

Let's test this, following the compiler flag recommendations for unsafePerformIO.

$ ghc -V
The Glorious Glasgow Haskell Compilation System, version 7.2.1

$ ghc -rtsopts -threaded -fno-cse -fno-full-laziness dupe.hs
[1 of 1] Compiling Main             ( dupe.hs, dupe.o )
Linking dupe ...

$ while true; do ./dupe +RTS -N | head -n 2; echo ----; done

Within a few seconds I see output like this:

----
IO executed by ThreadId 35
()
----
IO executed by ThreadId 78
IO executed by ThreadId 85
----
IO executed by ThreadId 48
()
----

In the middle run, two threads executed the IO action.

This bug was reported two weeks ago and is already fixed in GHC HEAD. I tested with GHC 7.3.20111026, aka g6f5b798, and the problem seemed to go away.

Unfortunately it will be some time before GHC 7.4 is widely deployed, so I'm thinking about workarounds for my original global locking problem. I'll probably store the lock in a C global variable via StablePtr, or failing that, implement all locking in C. But I'd appreciate any other suggestions.

The remainder of this article is an attempt to explain this GHC bug, and the fix committed by Simon Marlow. It's long because

  • I try not to assume you know anything about how GHC works. I don't know very much, myself.

  • There are various digressions.

Objects at runtime

Code produced by GHC can allocate many kinds of objects. Here are just a few:

  • CONSTR objects represent algebraic data constructors and their associated fields. The value (Just 'x') would be represented by a CONSTR object, holding a pointer to another object representing 'x'.

  • FUN objects represent functions, like the value (\x -> x+1).

  • THUNK objects represent computations which have not yet happened. Suppose we write:

    let x = 2 + 2 in f x x

    This code will construct a THUNK object for x and pass it to the code for f. Some time later, f may force evaluation of its argument, and the thunk will, in turn, invoke (+). When the thunk has finished evaluating, it is overwritten with the evaluation result. (Here, this might be an I# CONSTR holding the number 4.) If f then forces its second argument, which is also x, the work done by (+) is not repeated. This is the essence of lazy evaluation.

  • When a thunk is forced, it's first overwritten with a BLACKHOLE object. This BLACKHOLE is eventually replaced with the evaluation result. Therefore a BLACKHOLE represents a thunk which is currently being evaluated.

    Identifying this case helps the garbage collector, and it also gives GHC its seemingly magical ability to detect some infinite loops. Forcing a BLACKHOLE indicates a computation which cannot proceed until the same computation has finished. The GHC runtime will terminate the program with a <<loop>> exception.

  • We can't truly update thunks in place, because the evaluation result might be larger than the space originally allocated for the thunk. So we write an indirection pointing to the evaluation result. These IND objects will later be removed by the garbage collector.

Static objects

Dynamically-allocated objects make sense for values which are created as your program runs. But the top-level declarations in a Haskell module don't need to be dynamically allocated; they already exist when your program starts up. GHC allocates these static objects in your executable's data section, the same place where C global variables live.

Consider this program:

x = Just 'x'

f (Just _) = \y -> y+1

main = print (f x 3)

Ignoring optimizations, GHC will produce code where:

  • x is a CONSTR_STATIC object.

  • f is a FUN_STATIC object. When called, f will return a dynamically-allocated FUN object representing (\y -> y+1).

  • main is a THUNK_STATIC object. It represents the unevaluated expression formed by applying the function print to the argument (f x 3). A static thunk is also known as a constant applicative form, or a CAF for short. Like any other thunk, a CAF may or may not get evaluated. If evaluated, it will be replaced with a black hole and eventually the evaluation result. In this example, main will be evaluated by the runtime system, in deciding what IO to perform.

Black holes and revelations

That's all fine for a single-threaded Haskell runtime, but GHC supports running many Haskell threads across multiple OS threads. This introduces some additional complications. For example, one thread might force a thunk which is currently being evaluated by another thread. The thread will find a BLACKHOLE, but terminating the program would be incorrect. Instead the BLACKHOLE puts the current Haskell thread to sleep, and wakes it up when the evaluation result is ready.

If two threads force the same thunk at the same time, they will both perform the deferred computation. We could avoid this wasted effort by writing and checking for black holes using expensive atomic memory operations. But this is a poor tradeoff; we slow down every evaluation in order to prevent a rare race condition.

As a compiler for a language with pure evaluation, GHC has the luxury of tolerating some duplicated computation. Evaluating an expression twice can't change a program's behavior. And most thunks are cheap to evaluate, hardly worth the effort of avoiding duplication. So GHC follows a "lazy black-holing" strategy.12 Threads write black holes only when they enter the garbage collector. If a thread discovers that one of its thunks has already been claimed, it will abandon the duplicated work-in-progress. This scheme avoids large wasted computations without paying the price on small computations. You can find the gritty details within the function threadPaused, in rts/ThreadPaused.c.

unsafe[Dupable]PerformIO

You may remember that we started, all those many words ago, with a program that uses unsafePerformIO. This breaks the pure-evaluation property of Haskell. Repeated evaluation will affect semantics! Might lazy black-holing be the culprit in the original bug?

Naturally, the GHC developers thought about this case. Here's the implementation of unsafePerformIO:

unsafePerformIO m = unsafeDupablePerformIO (noDuplicate >> m)

noDuplicate = IO $ \s -> case noDuplicate# s of s' -> (# s', () #)

unsafeDupablePerformIO (IO m) = lazy (case m realWorld# of (# _, r #) -> r)

The core behavior is implemented by unsafeDupablePerformIO, using GHC's internal representation of IO actions (which is beyond the scope of this article, to the extent I even have a scope in mind). As the name suggests, unsafeDupablePerformIO provides no guarantee against duplicate execution. The more familiar unsafePerformIO builds this guarantee by first invoking the noDuplicate# primitive operation.

The implementation of noDuplicate#, written in GHC's Cmm intermediate language, handles a few tricky considerations. But it's basically a call to the function threadPaused, which we saw is responsible for lazy black-holing. In other words, thunks built from unsafePerformIO perform eager black-holing.

Since threadPaused has to walk the evaluation stack, unsafeDupablePerformIO might be much faster than unsafePerformIO. In practice, this will matter when performing a great number of very quick IO actions, like peeking a single byte from memory. In this case it is safe to duplicate IO, provided the buffer is unchanging. Let's measure the performance difference.

import GHC.IO
import Foreign hiding (unsafePerformIO)
import System.Random
import Criterion.Main

main = do
let sz = 1024*1024
buf <- mallocBytes sz
let get i = peekByteOff buf i :: IO Word8
peek_d i = unsafeDupablePerformIO (get i)
peek_n i = unsafePerformIO (get i)
idxes = take 1024 $ randomRs (0, sz-1) (mkStdGen 49)
evaluate (sum idxes) -- force idxes ahead of time
defaultMain
[ bench "dup" $ nf (map peek_d) idxes
, bench "noDup" $ nf (map peek_n) idxes ]

And the results:

$ ghc -rtsopts -threaded -O2 peek.hs && ./peek +RTS -N
...

benchmarking dup
mean: 76.42962 us, lb 75.11134 us, ub 78.18593 us, ci 0.950
std dev: 7.764123 us, lb 6.300310 us, ub 9.790345 us, ci 0.950

benchmarking noDup
mean: 142.1720 us, lb 139.7312 us, ub 145.4300 us, ci 0.950
std dev: 14.43673 us, lb 11.40254 us, ub 17.86663 us, ci 0.950

So performance-critical idempotent actions can benefit from unsafeDupablePerformIO. But most code should use the safer unsafePerformIO, as our bug reproducer does. And the noDuplicate# machinery for unsafePerformIO makes sense, so what's causing our bug?

The bug, finally

After all those details and diversions, let's go back to the fix for GHC bug #5558. The action is mostly in rts/sm/Storage.c. This file is part of GHC's storage manager, which provides services such as garbage collection.

Recall that our problematic code looked like this:

ioThunk :: ()
ioThunk = unsafePerformIO $ do ...

This is an application of the function ($) to the argument unsafePerformIO. So it's a static thunk, a CAF. Here's the old description of how CAF evaluation works, from Storage.c:

The entry code for every CAF does the following:

  • builds a BLACKHOLE in the heap
  • pushes an update frame pointing to the BLACKHOLE
  • calls newCaf, below
  • updates the CAF with a static indirection to the BLACKHOLE

Why do we build an BLACKHOLE in the heap rather than just updating the thunk directly? It's so that we only need one kind of update frame - otherwise we'd need a static version of the update frame too.

So here's the problem. Normal thunks get blackholed in place, and a thread detects duplicated evaluation by noticing that one of its thunks-in-progress became a BLACKHOLE. But static thunks — CAFs — are blackholed by indirection. Two threads might perform the above procedure concurrently, producing two different heap-allocated BLACKHOLEs, and they'd never notice.

As Simon Marlow put it:

Note [atomic CAF entry]

With THREADED_RTS, newCaf() is required to be atomic (see #5558). This is because if two threads happened to enter the same CAF simultaneously, they would create two distinct CAF_BLACKHOLEs, and so the normal threadPaused() machinery for detecting duplicate evaluation will not detect this. Hence in lockCAF() below, we atomically lock the CAF with WHITEHOLE before updating it with IND_STATIC, and return zero if another thread locked the CAF first. In the event that we lost the race, CAF entry code will re-enter the CAF and block on the other thread's CAF_BLACKHOLE.

I can't explain precisely what a WHITEHOLE means, but they're used for spin locks or wait-free synchronization in various places. For example, the MVar primitives are synchronized by the lockClosure spinlock routine, which uses WHITEHOLEs.

The fix

Here's the corrected CAF evaluation procedure:

The entry code for every CAF does the following:

  • builds a CAF_BLACKHOLE in the heap
  • calls newCaf, which atomically updates the CAF with IND_STATIC pointing to the CAF_BLACKHOLE
  • if newCaf returns zero, it re-enters the CAF (see Note [atomic CAF entry])
  • pushes an update frame pointing to the CAF_BLACKHOLE

newCAF is made atomic by introducing a new helper function, lockCAF, which is reproduced here for your viewing pleasure:

STATIC_INLINE StgWord lockCAF (StgClosure *caf, StgClosure *bh)
{
const StgInfoTable *orig_info;

orig_info = caf->header.info;

#ifdef THREADED_RTS
const StgInfoTable *cur_info;

if (orig_info == &stg_IND_STATIC_info ||
orig_info == &stg_WHITEHOLE_info) {
// already claimed by another thread; re-enter the CAF
return 0;
}

cur_info = (const StgInfoTable *)
cas((StgVolatilePtr)&caf->header.info,
(StgWord)orig_info,
(StgWord)&stg_WHITEHOLE_info);

if (cur_info != orig_info) {
// already claimed by another thread; re-enter the CAF
return 0;
}

// successfully claimed by us; overwrite with IND_STATIC
#endif

// For the benefit of revertCAFs(), save the original info pointer
((StgIndStatic *)caf)->saved_info = orig_info;

((StgIndStatic*)caf)->indirectee = bh;
write_barrier();
SET_INFO(caf,&stg_IND_STATIC_info);

return 1;
}

We grab the CAF's info table pointer, which tells us what kind of object it is. If it's not already claimed by another thread, we write a WHITEHOLE — but only if the CAF hasn't changed in the meantime. This step is an atomic compare-and-swap, implemented by architecture-specific code. The function cas is specified by this pseudocode:

cas(p,o,n) {
atomically {
r = *p;
if (r == o) { *p = n };
return r;
}
}

Here's the implementation for x86, using GCC extended inline assembly:

EXTERN_INLINE StgWord
cas(StgVolatilePtr p, StgWord o, StgWord n)
{
__asm__ __volatile__ (
"lock\ncmpxchg %3,%1"
:"=a"(o), "=m" (*(volatile unsigned int *)p)
:"0" (o), "r" (n));
return o;
}

There are some interesting variations between architectures. SPARC and x86 use single instructions, while PowerPC and ARMv6 have longer sequences. Old ARM processors require a global spinlock, which sounds painful. Who's running Haskell on ARMv5 chips?

*deep breath*

Thanks for reading / skimming this far! I learned a lot by writing this article, and I hope you enjoyed reading it. I'm sure I said something wrong somewhere, so please do not hesitate to correct me in the comments.


  1. Tim Harris, Simon Marlow, and Simon Peyton Jones. Haskell on a shared-memory multiprocessor. In Haskell '05: Proceedings of the 2005 ACM SIGPLAN workshop on Haskell, pages 49–61.

  2. Simon Marlow, Simon Peyton Jones, and Satnam Singh. Runtime Support for Multicore Haskell. In ICFP'09.

8 comments:

  1. > Who's running Haskell on ARMv5 chips?

    Debian does:
    $ uname -a
    Linux abel 2.6.32 #4 Thu Sep 29 18:52:43 UTC 2011 armv5tel GNU/Linux

    But it is causing problems: http://lists.debian.org/debian-haskell/2011/10/msg00065.html

    ReplyDelete
  2. I don't know GHC internals, and found this enlightening and interesting. Thanks :)

    ReplyDelete
  3. (Split into a new comment at the blog software's insistence.)

    I ended up with the following hierarchy of what things are safe to perform unsafely in what ways. I would appreciate any experienced GHC hackers telling me if I'm wrong about it.

    unsafePerformIO x: x must be externally observably pure: you must not be able to tell when or how many times x was evaluated from within a pure function.

    unsafeDupablePerformIO x: x must be externally pure, period: you must not be able to tell when or how many times x was evaluated from anywhere. Or, at least, it must not make a difference. I'm not sure if there's a one-to-one correspondence, but another way of putting it might be that the result of x must not have identity.

    unsafeInlinePerformIO x: x must be internally as well as externally pure; it must not be the composition of impure parts into a pure whole, because the GHC optimizer will take it, decompose it, and scatter its component parts over a wide area.

    ReplyDelete
  4. ...and now it's gone and lost the first half. Should I post it again? I guess I should make a blog post out of this, but I don't have a blog... maybe some day.

    ReplyDelete
  5. @illissius
    I have a more operational take on it (I am not an experienced GHC hacker):

    unsafeDupablePerformIO: the input must be idempotent, because we may run it twice.

    unsafeInlinePerformIO: the operation gets entirely inlined, so optimization passes may decide to float various bits and pieces out from under the call, including constant-sized allocations and other bad things.

    ReplyDelete
  6. @Antoine

    The bit about idempotency was exactly what the first half of my comment was about, and why I think it's wrong. For some reason it disappeared when I posted the second half. Anyway, here it is again...


    I really liked this post. Thanks.

    I was thinking about, asking about, and discussing unsafeDupablePerformIO on IRC a month or many months ago. My problem was that I didn't quite understand why it needed to exist -- or rather, why unsafeNotDupablePerformIO needed to exist. My initial thinking was that unsafeDupablePerformIO is safe for idempotent IO actions, while unsafePerformIO is safe even for non-idempotent ones. But as I thought about it, I realized this couldn't be right. 'Write $x into address $y in memory' is idempotent: doing it multiple times won't change the result. But does unsafeDupablePerformIO at least guarantee that any duplicate performances will all happen before the evaluation is considered 'done', and before the next thing happens? Otherwise, $x will be written into $y at some random point in time, which might be bad. It became obvious that it couldn't be so: some IO actions return results, and obviously it can't be returning more than one of them, so presumably it's going to use the first one it gets. And the other copies are going to silently execute in the background somewhere at some unknown time and have their results discarded. (IRC corrected me that this wasn't quite right, actually: what does happen is that more than one thread might request the result of a computation at the same time, and end up computing it independently along with any side effects, and each will use the result of its own computation.) But either way, idempotency clearly wasn't the right precondition for being safe to unsafeDupablePerformIO.

    I realized that my hypothetical idempotent IO action wasn't safe to unsafeNotDupablePerformIO, either. Why would you unsafelyPerformIO something without a result, but with a side effect, after all? That's very wrong by definition. You generally want the opposite: a useful result without side effects. If you have side effects, referential transparency breaks, and unsafePerformIO is unsafe. The reason why unsafePerformIO exists is because side effecting computations can be composed into pure ones.

    But if the IO action passed to unsafePerformIO must, by definition, be externally pure, in order to be correct, then why is evaluating it multiple times a problem? In other words, once you allow for a computation to be performed at an unpredictable time due to laziness, what further constraints does it provoke if you must also allow for it to be performed multiple times? What possible input could there be for which unsafePerformIO is safe and unsafeDupablePerformIO is not? After a productive discussion on IRC it became clear that the important word is 'observable'. All agreed that if you were to implement, say, immutable reference cells using IORefs in the background (though this would have little practical value) that using unsafeDupablePerformIO to implement the read, copy-and-write, etc. operations would have no observably ill effects... from within a pure function. The big difference-making thing turns out to be stuff that's not observable from a pure function, but *is* observable from the IO monad. You can't tell from a pure function whether evaluations of copyAndWrite from two separate threads got the same object as result or two equivalent objects, but if you start mucking with the bare IORefs in the IO monad and expect it to be the former, you might get burned. (As a more blatant example, you could have a hidden reference field which gets secretly updated by externally-pure functions but which is only accessible from IO.)

    ReplyDelete
  7. There is a slight factual inaccuracy in this article: it is stated that lazy black holing only occurs when threads enter the garbage collector; however, in modern GHC, lazy blackholing occurs whenever a thread returns to the scheduler (e.g. including if it was preempted). This allows us to avoid having to walk all the thread stacks to black hole at once when a GC occurs.

    ReplyDelete