Wednesday, December 21, 2011

Propane: Functional synthesis of images and animations in Haskell

I just released Propane, a Haskell libary for functional synthesis of images and animations. This is a generalization of my Repa-based quasicrystal code.

It's based on the same ideas as Pan and some other projects. An image is a function assigning a color to each point in the plane. Similarly, an animation assigns an image to each point in time. Haskell's tools for functional and declarative programming can be used directly on images and animations.

For example, you can draw a red-green gradient like so:

import Propane

main = saveImage "out.png" (Size 400 400) im where
    im (x,y) = cRGB (unbal x) (unbal y) 0

Here im is the image as a function, mapping an (x,y) coordinate to a color. unbal is a function provided by Propane, which just maps the interval [-1, 1] to [0, 1].

The source package includes an animated quasicrystal and several other examples. Propane uses Repa for data-parallel array computations. That means it automatically uses multiple CPU cores for rendering, provided the program is compiled and run with threads enabled. That said, it's not yet been optimized for speed in other ways.

This is just a toy right now, but do let me know if you come up with cool enhancements or examples!

Monday, November 7, 2011

Self-modifying code for debug tracing in quasi-C

Printing a program's state as it runs is the simple but effective debugging tool of programmers everywhere. For efficiency, we usually disable the most verbose output in production. But sometimes you need to diagnose a problem in a deployed system. It would be convenient to declare "tracepoints" and enable them at runtime, like so:

tracepoint foo_entry;

int foo(int n) {
  TRACE(foo_entry, "called foo(%d)\n", n);
  // ...
}

// Called from UI, monitoring interface, etc.
void debug_foo() {
  enable(&foo_entry);
}

Here's a simple implementation of this API:

typedef int tracepoint;

#define TRACE(_point, _args...) \
  do {                          \
    if (_point) printf(_args);  \
  } while (0)

static inline void enable(tracepoint *point) {
  *point = 1;
}

Each tracepoint is simply a global variable. The construct do { ... } while (0) is a standard trick to make macro-expanded code play nicely with its surroundings. We also use GCC's syntax for macros with a variable number of arguments.

This approach does introduce a bit of overhead. One concern is that reading a global variable will cause a cache miss and will also evict a line of useful data from the cache. There's also some impact from adding a branch instruction. We'll develop a significantly more complicated implementation which avoids both of these problems.

Our new solution will be specific to x86-64 processors running Linux, though the idea can be ported to other platforms. This approach is inspired by various self-modifying-code schemes in the Linux kernel, such as ftrace, kprobes, immediate values, etc. It's mostly intended as an example of how these tricks work. The code in this article is not production-ready.

The design

Our new TRACE macro will produce code like the following pseudo-assembly:

foo:
    ...
    ; code before tracepoint
    ...
tracepoint:
    nop
after_tracepoint:
    ...
    ; rest of function
    ...
    ret

do_tracepoint:
    push args to printf
    call printf
    jmp after_tracepoint

In the common case, the tracepoint is disabled, and the overhead is only a single nop instruction. To enable the tracepoint, we replace the nop instruction in memory with jmp do_tracepoint.

The `TRACE` macro

Our nop instruction needs to be big enough that we can overwrite it with an unconditional jump. On x86-64, the standard jmp instruction has a 1-byte opcode and a 4-byte signed relative displacement, so we need a 5-byte nop. Five one-byte 0x90 instructions would work, but a single five-byte instruction will consume fewer CPU resources. Finding the best way to do nothing is actually rather difficult, but the Linux kernel has already compiled a list of favorite nops. We'll use this one:

#define NOP5 ".byte 0x0f, 0x1f, 0x44, 0x00, 0x00;"

Let's check this instruction using udcli:

$ echo 0f 1f 44 00 00 | udcli -x -64 -att
0000000000000000 0f1f440000       nop 0x0(%rax,%rax)

GCC's extended inline assembly lets us insert arbitrarily bizarre assembly code into a normal C program. We'll use the asm goto flavor, new in GCC 4.5, so that we can pass C labels into our assembly code. (The tracing use case inspired the asm goto feature, and my macro is adapted from an example in the GCC manual.)

Here's how it looks:

typedef int tracepoint;

#define TRACE(_point, _args...)          \
  do {                                   \
    asm goto (                           \
      "0: " NOP5                         \
      ".pushsection trace_table, \"a\";" \
      ".quad " #_point ", 0b, %l0;"      \
      ".popsection"                      \
      : : : : __lbl_##_point);           \
    if (0) {                             \
      __lbl_##_point: printf(_args);     \
    }                                    \
  } while (0)

We use the stringify and concat macro operators, and rely on the gluing together of adjacent string literals. A call like this:

TRACE(foo_entry, "called foo(%d)\n", n);

will produce the following code:

  do {
    asm goto (
      "0: .byte 0x0f, 0x1f, 0x44, 0x00, 0x00;"
      ".pushsection trace_table, \"a\";"
      ".quad foo_entry, 0b, %l0;"
      ".popsection"
      : : : : __lbl_foo_entry);
    if (0) {
      __lbl_foo_entry: printf("called foo(%d)\n", n);
    }
  } while (0);

Besides emitting the nop instruction, we write three 64-bit values ("quads"). They are, in order:

The address of the tracepoint variable declared by the user. We never actually read or write this variable. We're just using its address as a unique key.
The address of the nop instruction, by way of a local assembler label.
The address of the C label for our printf call, as passed to asm goto.

This is the information we need in order to patch in a jmp at runtime. The .pushsection directive makes the assembler write into the trace_table section without disrupting the normal flow of code and data. The "a" section flag marks these bytes as "allocatable", i.e. something we actually want available at runtime.

We count on GCC's optimizer to notice that the condition 0 is unlikely to be true, and therefore move the if body to the end of the function. It's still considered reachable due to the label passed to asm goto, so it will not fall victim to dead code elimination.

The linker script

We have to collect all of these trace_table records, possibly from multiple source files, and put them somewhere for use by our C code. We'll do this with the following linker script:

SECTIONS {
  trace_table : {
    trace_table_start = .;
    *(trace_table)
    trace_table_end = .;
  }
}

This concatenates all trace_table sections into a single section in the resulting binary. It also provides symbols trace_table_start and trace_table_end at the endpoints of this section.

Memory protection

Linux systems will prevent an application from overwriting its own code, for good security reasons, but we can explicitly override these permissions. Memory permissions are managed per page of memory. There's a correct way to determine the size of a page, but our code is terribly x86-specific anyway, so we'll hardcode the page size of 4096 bytes.

#define PAGE_SIZE 4096
#define PAGE_OF(_addr) ( ((uint64_t) (_addr)) & ~(PAGE_SIZE-1) )

Then we can unprotect an arbitrary region of memory by calling mprotect for the appropriate page(s):

static void unprotect(void *addr, size_t len) {
  uint64_t pg1 = PAGE_OF(addr),
           pg2 = PAGE_OF(addr + len - 1);
  if (mprotect((void *) pg1, pg2 - pg1 + PAGE_SIZE,
               PROT_READ | PROT_EXEC | PROT_WRITE)) {
    perror("mprotect");
    abort();
  }
}

We're calling mprotect on a page which was not obtained from mmap. POSIX does not define this behavior, but Linux specifically allows mprotect on any page except the vsyscall page.

Enabling a tracepoint

Now we need to implement the enable function:

void enable(tracepoint *point);

We will scan through the trace_table records looking for a matching tracepoint pointer. The C struct corresponding to a trace_table record is:

struct trace_desc {
  tracepoint *point;
  void *jump_from;
  void *jump_to;
} __attribute__((packed));

The packed attribute tells GCC not to insert any padding within or after these structs. This ensures that their layout will match the records we produced from assembly. Now we can implement a linear search through this table.

void enable(tracepoint *point) {
  extern struct trace_desc trace_table_start[], trace_table_end[];
  struct trace_desc *desc;
  for (desc = trace_table_start; desc < trace_table_end; desc++) {
    if (desc->point != point)
      continue;

    int64_t offset = (desc->jump_to - desc->jump_from) - 5;
    if ((offset > INT32_MAX) || (offset < INT32_MIN)) {
      fprintf(stderr, "offset too big: %lx\n", offset);
      abort();
    }

    int32_t offset32 = offset;
    unsigned char *dest = desc->jump_from;
    unprotect(dest, 5);
    dest[0] = 0xe9;
    memcpy(dest+1, &offset32, 4);
  }
}

We enable a tracepoint by overwriting its nop with an unconditional jump. The opcode is 0xe9. The operand is a 32-bit displacement, interpreted relative to the instruction after the jump. desc->jump_from points to the beginning of what will be the jump instruction, so we subtract 5 from the displacement. Then we unprotect memory and write the new bytes into place.

That's everything. You can grab all of this code from GitHub, including a simple test program.

Pitfalls

Where to start?

This code is extremely non-portable, relying on details of x86-64, Linux, and specific recent versions of the GNU C compiler and assembler. The idea can be ported to other platforms, with some care. For example, ARM processors require an instruction cache flush after writing to code. Linux on ARM implements the cacheflush system call for this purpose.

Our code is not thread-safe, either. If one thread reaches a nop while it is being overwritten by another thread, the result will surely be a crash or other horrible bug. The Ksplice paper [PDF] discusses how to prevent this, in the context of live-patching the Linux kernel.

Is it worth opening this can of worms in order to improve performance a little? In general, no. Obviously we'd have to measure the performance difference to be sure. But for most projects, concerns of maintainability and avoiding bugs will preclude tricky hacks like this one.

The Linux kernel is under extreme demands for both performance and flexibility. It's part of every application on a huge number of systems, so any small performance improvement has a large aggregate effect. And those systems are incredibly diverse, making it likely that someone will see a large difference. Finally, kernel development will always involve tricky low-level code as a matter of course. The infrastructure is already there to support it — both software infrastructure and knowledgeable developers.

Friday, November 4, 2011

Global locking through StablePtr

I spoke before of using global locks in Haskell to protect a thread-unsafe C library. And I wrote about a GHC bug which breaks the most straightforward way to get a global lock.

My new solution is to store an MVar lock in a C global variable via StablePtr. I've implemented this, and it seems to work. I'd appreciate if people could bang on this code and report any issues.

You can get the library from Hackage or browse the source, including a test program. You can also use this code as a template for including a similar lock in your own Haskell project.

The C code

On the C side, we declare a global variable and a function to read that variable.

static void* global = 0;

void* hs_globalzmlock_get_global(void) {
    return global;
}

To avoid name clashes, I gave this function a long name based on the z-encoding of my package's name. The variable named global will not conflict with another compilation unit, because it's declared static.

Another C function will set this variable, if it was previously 0. Two threads might execute this code concurrently, so we use a GCC built-in for atomic memory access.

int hs_globalzmlock_set_global(void* new_global) {
    void* old = __sync_val_compare_and_swap(&global, 0, new_global);
    return (old == 0);
}

If old is not 0, then someone has already set global, and our assignment was dropped. We report this condition to the caller.

Foreign imports

On the Haskell side, we import these C functions.

foreign import ccall unsafe "hs_globalzmlock_get_global"
    c_get_global :: IO (Ptr ())

foreign import ccall "hs_globalzmlock_set_global"
    c_set_global :: Ptr () -> IO CInt

The unsafe import of c_get_global demands justification. This wrinkle arises from the fact that GHC runs many Haskell threads on the same OS thread. A long-running foreign call from that OS thread might block unrelated Haskell code. GHC prevents this by moving the foreign call and/or other Haskell threads to a different OS thread. This adds latency to the foreign call — about 100 nanoseconds in my tests.

In most cases a 100 ns overhead is negligible. But it matters for functions which are guaranteed to return in a very short amount of time. And blocking other Haskell threads during such a short call is fine. Marking the import unsafe tells GHC to ignore the blocking concern, and generate a direct C function call.

Our function c_get_global is a good use case for unsafe, because it simply returns a global variable. In my tests, adding unsafe decreased the overall latency of locking by about 50%. We cannot use unsafe with c_set_global because, in the worst case, GCC implements atomic operations with blocking library functions. That's okay because c_set_global will only be called a few times anyway.

The Haskell code

Now we have access to a C global of type void*, and we want to store a Haskell value of type MVar (). The StablePtr module is just what we need. A StablePtr is a reference to some Haskell expression, which can be converted to Ptr (), aka void*. There is no guarantee about this Ptr () value, except that it can be converted back to the original StablePtr.

Here's how we store an MVar:

set :: IO ()
set = do
    mv  <- newMVar ()
    ptr <- newStablePtr mv
    ret <- c_set_global (castStablePtrToPtr ptr)
    when (ret == 0) $
        freeStablePtr ptr

It's fine for two threads to enter set concurrently. In one thread, the assignment will be dropped, and c_set_global will return 0. In that case we free the unused StablePtr, and the MVar will eventually be garbage-collected. StablePtrs must be freed manually, because the GHC garbage collector can't tell if some C code has stashed away the corresponding void*.

Now we can retrieve the MVar, or create it if necessary.

get :: IO (MVar ())
get = do
    p <- c_get_global
    if p == nullPtr
        then set >> get
        else deRefStablePtr (castPtrToStablePtr p)

In the common path, we do an unsynchronized read on the global variable. Only if the variable appears to contain NULL do we allocate an MVar, perform a synchronized compare-and-swap, etc. This keeps overhead low, and makes this library suitable for fine-grained locking.

All that's left is the user-visible locking interface:

lock :: IO a -> IO a
lock act = get >>= flip withMVar (const act)

Inspecting the machine code

Just for fun, let's see how GCC implements __sync_val_compare_and_swap on the AMD64 architecture.

$ objdump -d dist/build/cbits/global.o
...
0000000000000010 <hs_globalzmlock_set_global>:
  10:   31 c0                   xor    %eax,%eax
  12:   f0 48 0f b1 3d 00 00    lock cmpxchg %rdi,0x0(%rip)
  19:   00 00
....

This lock cmpxchg is the same instruction used by the GHC runtime system for its own atomic compare-and-swap. The offset on the operand 0x0(%rip) will be relocated to point at global.

Thursday, November 3, 2011

Haskell hackathon in the Boston area, January 20 to 22

The global sensation that is the Haskell Hackathon is coming to the Boston area. Hac Boston will be held January 20 to 22, 2012 in Cambridge, Massachusetts. It's open to all; you do not need to be a Haskell guru to attend. All you need is a basic knowledge of Haskell, a willingness to learn, and a project you're excited to help with (or a project of your own to work on).

Spaces are filling up, so be sure to register if you plan on coming. You can also coordinate projects on the HaskellWiki.

MIT is providing space (exact room to be determined) and Capital IQ is sponsoring the event. In addition to coding, there will be food and some short talks. I'm interested in giving a ~20 minute talk of some kind, with slides also available online. What would people like to hear about?

Monday, October 31, 2011

Thunks and lazy blackholes: an introduction to GHC at runtime

This article is about a GHC bug I encountered recently, but it's really an excuse to talk about some GHC internals at an intro level. (In turn, an excuse for me to learn about those internals.)

I'll assume you're familiar with the basics of Haskell and lazy evaluation.

The bug

I spoke before of using global locks in Haskell to protect a thread-unsafe C library. Unfortunately a GHC bug prevents this from working. Using unsafePerformIO at the top level of a file can result in IO that happens more than once.

Here is a simple program which illustrates the problem:

import Control.Concurrent
import Control.Monad
import System.IO.Unsafe

ioThunk :: ()
ioThunk = unsafePerformIO $ do
    me <- myThreadId
    putStrLn ("IO executed by " ++ show me)
{-# NOINLINE ioThunk #-}

main :: IO ()
main = do
    replicateM_ 100 (forkIO (print ioThunk))
    threadDelay 10000  -- wait for other threads

Let's test this, following the compiler flag recommendations for unsafePerformIO.

$ ghc -V
The Glorious Glasgow Haskell Compilation System, version 7.2.1

$ ghc -rtsopts -threaded -fno-cse -fno-full-laziness dupe.hs
[1 of 1] Compiling Main             ( dupe.hs, dupe.o )
Linking dupe ...

$ while true; do ./dupe +RTS -N | head -n 2; echo ----; done

Within a few seconds I see output like this:

----
IO executed by ThreadId 35
()
----
IO executed by ThreadId 78
IO executed by ThreadId 85
----
IO executed by ThreadId 48
()
----

In the middle run, two threads executed the IO action.

This bug was reported two weeks ago and is already fixed in GHC HEAD. I tested with GHC 7.3.20111026, aka g6f5b798, and the problem seemed to go away.

Unfortunately it will be some time before GHC 7.4 is widely deployed, so I'm thinking about workarounds for my original global locking problem. I'll probably store the lock in a C global variable via StablePtr, or failing that, implement all locking in C. But I'd appreciate any other suggestions.

The remainder of this article is an attempt to explain this GHC bug, and the fix committed by Simon Marlow. It's long because

I try not to assume you know anything about how GHC works. I don't know very much, myself.
There are various digressions.

Objects at runtime

Code produced by GHC can allocate many kinds of objects. Here are just a few:

CONSTR objects represent algebraic data constructors and their associated fields. The value (Just 'x') would be represented by a CONSTR object, holding a pointer to another object representing 'x'.
FUN objects represent functions, like the value (\x -> x+1).
THUNK objects represent computations which have not yet happened. Suppose we write:
```
let x = 2 + 2 in f x x
```
This code will construct a THUNK object for x and pass it to the code for f. Some time later, f may force evaluation of its argument, and the thunk will, in turn, invoke (+). When the thunk has finished evaluating, it is overwritten with the evaluation result. (Here, this might be an I# CONSTR holding the number 4.) If f then forces its second argument, which is also x, the work done by (+) is not repeated. This is the essence of lazy evaluation.
When a thunk is forced, it's first overwritten with a BLACKHOLE object. This BLACKHOLE is eventually replaced with the evaluation result. Therefore a BLACKHOLE represents a thunk which is currently being evaluated.

Identifying this case helps the garbage collector, and it also gives GHC its seemingly magical ability to detect some infinite loops. Forcing a BLACKHOLE indicates a computation which cannot proceed until the same computation has finished. The GHC runtime will terminate the program with a <<loop>> exception.
We can't truly update thunks in place, because the evaluation result might be larger than the space originally allocated for the thunk. So we write an indirection pointing to the evaluation result. These IND objects will later be removed by the garbage collector.

Static objects

Dynamically-allocated objects make sense for values which are created as your program runs. But the top-level declarations in a Haskell module don't need to be dynamically allocated; they already exist when your program starts up. GHC allocates these static objects in your executable's data section, the same place where C global variables live.

Consider this program:

x = Just 'x'

f (Just _) = \y -> y+1

main = print (f x 3)

Ignoring optimizations, GHC will produce code where:

x is a CONSTR_STATIC object.
f is a FUN_STATIC object. When called, f will return a dynamically-allocated FUN object representing (\y -> y+1).
main is a THUNK_STATIC object. It represents the unevaluated expression formed by applying the function print to the argument (f x 3). A static thunk is also known as a constant applicative form, or a CAF for short. Like any other thunk, a CAF may or may not get evaluated. If evaluated, it will be replaced with a black hole and eventually the evaluation result. In this example, main will be evaluated by the runtime system, in deciding what IO to perform.

Black holes and revelations

That's all fine for a single-threaded Haskell runtime, but GHC supports running many Haskell threads across multiple OS threads. This introduces some additional complications. For example, one thread might force a thunk which is currently being evaluated by another thread. The thread will find a BLACKHOLE, but terminating the program would be incorrect. Instead the BLACKHOLE puts the current Haskell thread to sleep, and wakes it up when the evaluation result is ready.

If two threads force the same thunk at the same time, they will both perform the deferred computation. We could avoid this wasted effort by writing and checking for black holes using expensive atomic memory operations. But this is a poor tradeoff; we slow down every evaluation in order to prevent a rare race condition.

As a compiler for a language with pure evaluation, GHC has the luxury of tolerating some duplicated computation. Evaluating an expression twice can't change a program's behavior. And most thunks are cheap to evaluate, hardly worth the effort of avoiding duplication. So GHC follows a "lazy black-holing" strategy.¹² Threads write black holes only when they enter the garbage collector. If a thread discovers that one of its thunks has already been claimed, it will abandon the duplicated work-in-progress. This scheme avoids large wasted computations without paying the price on small computations. You can find the gritty details within the function threadPaused, in rts/ThreadPaused.c.

`unsafe[Dupable]PerformIO`

You may remember that we started, all those many words ago, with a program that uses unsafePerformIO. This breaks the pure-evaluation property of Haskell. Repeated evaluation will affect semantics! Might lazy black-holing be the culprit in the original bug?

Naturally, the GHC developers thought about this case. Here's the implementation of unsafePerformIO:

unsafePerformIO m = unsafeDupablePerformIO (noDuplicate >> m)

noDuplicate = IO $ \s -> case noDuplicate# s of s' -> (# s', () #)

unsafeDupablePerformIO (IO m) = lazy (case m realWorld# of (# _, r #) -> r)

The core behavior is implemented by unsafeDupablePerformIO, using GHC's internal representation of IO actions (which is beyond the scope of this article, to the extent I even have a scope in mind). As the name suggests, unsafeDupablePerformIO provides no guarantee against duplicate execution. The more familiar unsafePerformIO builds this guarantee by first invoking the noDuplicate# primitive operation.

The implementation of noDuplicate#, written in GHC's Cmm intermediate language, handles a few tricky considerations. But it's basically a call to the function threadPaused, which we saw is responsible for lazy black-holing. In other words, thunks built from unsafePerformIO perform eager black-holing.

Since threadPaused has to walk the evaluation stack, unsafeDupablePerformIO might be much faster than unsafePerformIO. In practice, this will matter when performing a great number of very quick IO actions, like peeking a single byte from memory. In this case it is safe to duplicate IO, provided the buffer is unchanging. Let's measure the performance difference.

import GHC.IO
import Foreign hiding (unsafePerformIO)
import System.Random
import Criterion.Main

main = do
    let sz = 1024*1024
    buf <- mallocBytes sz
    let get    i = peekByteOff buf i :: IO Word8
        peek_d i = unsafeDupablePerformIO (get i)
        peek_n i = unsafePerformIO        (get i)
        idxes = take 1024 $ randomRs (0, sz-1) (mkStdGen 49)
    evaluate (sum idxes)  -- force idxes ahead of time
    defaultMain
        [ bench "dup"   $ nf (map peek_d) idxes
        , bench "noDup" $ nf (map peek_n) idxes ]

And the results:

$ ghc -rtsopts -threaded -O2 peek.hs && ./peek +RTS -N
...

benchmarking dup
mean: 76.42962 us, lb 75.11134 us, ub 78.18593 us, ci 0.950
std dev: 7.764123 us, lb 6.300310 us, ub 9.790345 us, ci 0.950

benchmarking noDup
mean: 142.1720 us, lb 139.7312 us, ub 145.4300 us, ci 0.950
std dev: 14.43673 us, lb 11.40254 us, ub 17.86663 us, ci 0.950

So performance-critical idempotent actions can benefit from unsafeDupablePerformIO. But most code should use the safer unsafePerformIO, as our bug reproducer does. And the noDuplicate# machinery for unsafePerformIO makes sense, so what's causing our bug?

The bug, finally

After all those details and diversions, let's go back to the fix for GHC bug #5558. The action is mostly in rts/sm/Storage.c. This file is part of GHC's storage manager, which provides services such as garbage collection.

Recall that our problematic code looked like this:

ioThunk :: ()
ioThunk = unsafePerformIO $ do ...

This is an application of the function ($) to the argument unsafePerformIO. So it's a static thunk, a CAF. Here's the old description of how CAF evaluation works, from Storage.c:

The entry code for every CAF does the following:

builds a BLACKHOLE in the heap

pushes an update frame pointing to the BLACKHOLE

calls newCaf, below

updates the CAF with a static indirection to the BLACKHOLE

Why do we build an BLACKHOLE in the heap rather than just updating the thunk directly? It's so that we only need one kind of update frame - otherwise we'd need a static version of the update frame too.

So here's the problem. Normal thunks get blackholed in place, and a thread detects duplicated evaluation by noticing that one of its thunks-in-progress became a BLACKHOLE. But static thunks — CAFs — are blackholed by indirection. Two threads might perform the above procedure concurrently, producing two different heap-allocated BLACKHOLEs, and they'd never notice.

As Simon Marlow put it:

Note [atomic CAF entry]

With THREADED_RTS, newCaf() is required to be atomic (see #5558). This is because if two threads happened to enter the same CAF simultaneously, they would create two distinct CAF_BLACKHOLEs, and so the normal threadPaused() machinery for detecting duplicate evaluation will not detect this. Hence in lockCAF() below, we atomically lock the CAF with WHITEHOLE before updating it with IND_STATIC, and return zero if another thread locked the CAF first. In the event that we lost the race, CAF entry code will re-enter the CAF and block on the other thread's CAF_BLACKHOLE.

I can't explain precisely what a WHITEHOLE means, but they're used for spin locks or wait-free synchronization in various places. For example, the MVar primitives are synchronized by the lockClosure spinlock routine, which uses WHITEHOLEs.

The fix

Here's the corrected CAF evaluation procedure:

The entry code for every CAF does the following:

builds a CAF_BLACKHOLE in the heap

calls newCaf, which atomically updates the CAF with IND_STATIC pointing to the CAF_BLACKHOLE

if newCaf returns zero, it re-enters the CAF (see Note [atomic CAF entry])

pushes an update frame pointing to the CAF_BLACKHOLE

newCAF is made atomic by introducing a new helper function, lockCAF, which is reproduced here for your viewing pleasure:

STATIC_INLINE StgWord lockCAF (StgClosure *caf, StgClosure *bh)
{
    const StgInfoTable *orig_info;

    orig_info = caf->header.info;

#ifdef THREADED_RTS
    const StgInfoTable *cur_info;

    if (orig_info == &stg_IND_STATIC_info ||
        orig_info == &stg_WHITEHOLE_info) {
        // already claimed by another thread; re-enter the CAF
        return 0;
    }

    cur_info = (const StgInfoTable *)
        cas((StgVolatilePtr)&caf->header.info,
            (StgWord)orig_info,
            (StgWord)&stg_WHITEHOLE_info);

    if (cur_info != orig_info) {
        // already claimed by another thread; re-enter the CAF
        return 0;
    }

    // successfully claimed by us; overwrite with IND_STATIC
#endif

    // For the benefit of revertCAFs(), save the original info pointer
    ((StgIndStatic *)caf)->saved_info  = orig_info;

    ((StgIndStatic*)caf)->indirectee = bh;
    write_barrier();
    SET_INFO(caf,&stg_IND_STATIC_info);

    return 1;
}

We grab the CAF's info table pointer, which tells us what kind of object it is. If it's not already claimed by another thread, we write a WHITEHOLE — but only if the CAF hasn't changed in the meantime. This step is an atomic compare-and-swap, implemented by architecture-specific code. The function cas is specified by this pseudocode:

cas(p,o,n) {
   atomically {
      r = *p;
      if (r == o) { *p = n };
      return r;
   }
}

Here's the implementation for x86, using GCC extended inline assembly:

EXTERN_INLINE StgWord
cas(StgVolatilePtr p, StgWord o, StgWord n)
{
    __asm__ __volatile__ (
          "lock\ncmpxchg %3,%1"
          :"=a"(o), "=m" (*(volatile unsigned int *)p)
          :"0" (o), "r" (n));
    return o;
}

There are some interesting variations between architectures. SPARC and x86 use single instructions, while PowerPC and ARMv6 have longer sequences. Old ARM processors require a global spinlock, which sounds painful. Who's running Haskell on ARMv5 chips?

deep breath

Thanks for reading / skimming this far! I learned a lot by writing this article, and I hope you enjoyed reading it. I'm sure I said something wrong somewhere, so please do not hesitate to correct me in the comments.

Tim Harris, Simon Marlow, and Simon Peyton Jones. Haskell on a shared-memory multiprocessor. In Haskell '05: Proceedings of the 2005 ACM SIGPLAN workshop on Haskell, pages 49–61. ↩
Simon Marlow, Simon Peyton Jones, and Satnam Singh. Runtime Support for Multicore Haskell. In ICFP'09. ↩

Monday, October 24, 2011

Quasicrystals as sums of waves in the plane

On the suggestion of a friend, I rendered this animation:

This quasicrystal is full of emergent patterns, but it can be described in a simple way. Imagine that every point in the plane is shaded according to the cosine of its y coordinate. The result would look like this:

Now we can rotate this image to get other waves, like these:

Each frame of the animation is a summation of such waves at evenly-spaced rotations. The animation occurs as each wave moves forward.

I recommend viewing it up close, and then from a few feet back. There are different patterns at each spatial scale.

The code

To render this animation I wrote a Haskell program, using the Repa array library. For my purposes, the advantages of Repa are:

Immutable arrays, supporting clean, expressive code
A fast implementation, including automatic parallelization
Easy output to image files, via repa-devil

Here is a simplified (but complete!) program, which renders a single still image.

import Data.Array.Repa ( Array, DIM2, DIM3, Z(..), (:.)(..) )
import qualified Data.Array.Repa          as R
import qualified Data.Array.Repa.IO.DevIL as D

import Data.Word  ( Word8   )
import Data.Fixed ( divMod' )

For clarity, we define a few type synonyms:

type R     = Float
type R2    = (R, R)
type Angle = R

We'll convert pixel indices to coordinates in the real plane, with origin at the image center. We have to decide how many pixels to draw, and how much of the plane to show.

pixels :: Int
pixels = 800
scale :: R
scale = 128

Repa's array indices are "snoc lists" of the form (Z :. x :. y). By contrast, our planar coordinates are conventional tuples.

point :: DIM2 -> R2
point = \(Z :. x :. y) -> (adj x, adj y) where
    adj n = scale * ((2 * fromIntegral n / denom) - 1)
    denom = fromIntegral pixels - 1

A single wave is a cosine depending on x and y coordinates in some proportion, determined by the wave's orientation angle.

wave :: Angle -> R2 -> R
wave th = f where
    (cth, sth) = (cos th, sin th)
    f (x,y) = (cos (cth*x + sth*y) + 1) / 2

To combine several functions, we sum their outputs, and wrap to produce a result between 0 and 1. As n increases, (wrap n) will rise to 1, fall back to 0, rise again, and so on. sequence converts a list of functions to a function returning a list, using the monad instance for ((->) r).

combine :: [R2 -> R] -> (R2 -> R)
combine xs = wrap . sum . sequence xs where
    wrap n = case divMod' n 1 of
        (k, v) | odd k     -> 1-v
               | otherwise -> v

To draw the quasicrystal, we combine waves at 7 angles evenly spaced between 0 and π.

angles :: Int -> [Angle]
angles n = take n $ enumFromThen 0 (pi / fromIntegral n)

quasicrystal :: DIM2 -> R
quasicrystal = combine (map wave (angles 7)) . point

We convert an array of floating-point values to an image in two steps. First, we map floats in [0,1] to bytes in [0,255]. Then we copy this to every color channel. The result is a 3-dimensional array, indexed by (row, column, channel). repa-devil takes such an array and outputs a PNG image file.

toImage :: Array DIM2 R -> Array DIM3 Word8
toImage arr = R.traverse arr8 (:. 4) chans where
    arr8 = R.map (floor . (*255) . min 1 . max 0) arr
    chans _ (Z :. _ :. _ :. 3) = 255  -- alpha channel
    chans a (Z :. x :. y :. _) = a (Z :. x :. y)

main :: IO ()
main = do
    let arr = R.fromFunction (Z :. pixels :. pixels) quasicrystal
    D.runIL $ D.writeImage "out.png" (toImage arr)

Running it

Repa's array operations automatically run in parallel. We just need to enable GHC's threaded runtime.

$ ghc -O2 -rtsopts -threaded quasicrystal.lhs
$ ./quasicrystal +RTS -N
$ xview out.png

And it looks like this:

Note that repa-devil silently refuses to overwrite an existing file, so you may need to rm out.png first.

On my 6-core machine, this parallel code ran in 3.72 seconds of wall-clock time, at a CPU utilization of 474%. The same code compiled without -threaded took 14.20 seconds, so the net efficiency of parallelization is 382%. This is a good result; what's better is how little work it required on my part. Cutting a mere 10 seconds from a single run is not a big deal. But it starts to matter when rendering many frames of animation, and trying out variations on the algorithm.

As a side note, switching from Float to Double increased the run time by about 30%. I suspect this is due to increased demand for memory bandwidth and cache space.

You can grab the Literate Haskell source and try it out on your own machine. This is my first Repa program ever, so I'd much appreciate feedback on improving the code.

Be sure to check out Michael Rule's work on animating quasicrystals.

Friday, October 21, 2011

Interfacing Haskell to the Concorde solver for Traveling Salesperson Problem

The Traveling Salesperson Problem (TSP) is a famous optimization problem with applications in logistics, manufacturing, and art. In its planar form, we are given a set of "cities", and we want to visit each city while minimizing the total travel distance.

Finding the shortest possible tour is NP-hard, and quickly becomes infeasible as the number of cities grows. But most applications need only a heuristically good solution: a tour which is short, if not the shortest possible. The Lin-Kernighan heuristic quickly produces such tours.

The Concorde project provides a well-regarded collection of TSP solvers. I needed TSP heuristics for a Haskell project, so I wrote a Haskell interface to Concorde's Lin-Kernighan implementation. Concorde provides a C library, but it's far from clear how to use it. Instead I chose to invoke the linkern executable as a subprocess.

The core of the Haskell interface looks like this:

tsp
  :: Config     -- provides various configurable parameters
  -> (a -> R2)  -- gives the rectangular coordinates of each point
  -> [a]        -- list of points to visit
  -> IO [a]     -- produces points permuted in tour order

tsp lets you represent the points to visit using any type you like. You just provide a function to get the coordinates of each point. The Config parameter controls various aspects of the computation, including the time/quality tradeoff. Defaults are provided, and you can override these selectively using record-update syntax. All considered it's a pretty simple interface which tries to hide the complexity of interacting with an external program.

Visualizing a tour

Here's a example program which computes a tour of 1,000 random points. We'll visualize the tour using the Diagrams library.

import Diagrams.Prelude
    ( Diagram  , Point(P), fillColor   , lineWidth
    , translate, circle  , fromVertices, lightgrey )

import Diagrams.Backend.Cairo.CmdLine
    ( Cairo, defaultMain )

import Data.Colour.SRGB         ( sRGB       )
import Data.Colour.RGBSpace     ( uncurryRGB )
import Data.Colour.RGBSpace.HSV ( hsv        )

import qualified Algorithms.Concorde.LinKern as T

import Control.Monad
import Data.Monoid
import System.Random

tsp takes a list of points and a function to extract the coordinates of a point. Our points are just the coordinates themselves, so we pass the identity function.

type R2 = (Double, Double)

findTour :: [R2] -> IO [R2]
findTour = T.tsp cfg id where
    cfg = T.defConfig { T.verbose = True }

The tour is drawn as a loop of line segements. We also shade the interior of this polygon.

diaTour :: [R2] -> Diagram Cairo R2
diaTour xs@(x:_) = sty . fromVertices $ map P (xs ++ [x]) where
    sty = fillColor lightgrey . lineWidth 10

Each point visited by the tour is drawn as a circle, with hue indicating its position in the tour.

diaPoints :: [R2] -> Diagram Cairo R2
diaPoints = mconcat . map circ . zip [0..] where
    n = fromIntegral numPoints
    circ (i,p) = translate p . fillColor color $ circle 40
        where color = uncurryRGB sRGB (hsv (360*i/n) 1 1)

Now we put it all together. Note that linkern uses Euclidean distances rounded to the nearest integer. So we need coordinates with fairly large magnitudes. Picking values between 0 and 1 won't work.

numPoints :: Int
numPoints = 1000

main :: IO ()
main = do
    let rnd = randomRIO (0,10000)
    points <- replicateM numPoints (liftM2 (,) rnd rnd)
    tour   <- findTour points
    defaultMain (diaPoints tour `mappend` diaTour tour)

We run it like so:

$ export PATH=~/concorde-031219/LINKERN:$PATH
$ runhaskell tour.lhs -o out.pdf
$ xpdf out.pdf

The computation takes about 2 seconds on my machine. And the output looks like this:

You can download this post as a Literate Haskell file and run the above program. You'll need to install the concorde and diagrams packages.

The source for the concorde Haskell package includes a more full-featured version of this example.

Tuesday, October 18, 2011

Safe top-level mutable variables for Haskell

I uploaded the safe-globals package for Haskell, which lets you declare top-level mutable variables like so:

{-# LANGUAGE TemplateHaskell #-}
import Data.Global
import Data.IORef

declareIORef "ref"
    [t| Int |]
    [e| 3   |]

main = do
    readIORef  ref >>= print  -- prints 3
    writeIORef ref 5
    readIORef  ref >>= print  -- prints 5

This creates a module-level binding

ref :: IORef Int

which can be managed through the usual module import/export mechanism.

Why global state?

Global state is a sign of bad software design, especially in Haskell. Why would we ever need it? Suppose you're wrapping a C library which is not thread-safe. Using a (hidden!) global lock, you can expose an interface which is simple and safe. In other words, you're using global state to compensate for others using global state.¹ Another use case is generating unique identifiers to speed up comparison of values. This can be done without breaking referential transparency, but you need a source of IDs which is really and truly global.

In these situations it's typical to create global variables using a hack such as

ref :: IORef Int
ref = unsafePerformIO (newIORef 3)
{-# NOINLINE ref #-}

My library is just a set of Template Haskell macros for the same hack. If global variables are seldom needed, then what good are these macros?

Writing out the hack each time is unsafe. I might forget the NOINLINE pragma, or subvert the type system with a polymorphic reference. The safe-globals library prevents these mistakes. I'm of the opinion — and I know it's not shared by all — that even questionable techniques should be made as safe as possible. Call it "harm reduction" if you like.
In ten years, if GHC 9 requires an extra pragma for safety, then safe-globals can be updated, without changing every package that uses it. If JHC's ACIO feature is ported to GHC, then safe-globals can take advantage and get rid of the hacks entirely.

But the direct impetus to write safe-globals was the appearance of the global-variables library, which drew some attention in the Haskell community. global-variables aims to solve the same problem, using a different approach with a number of drawbacks. The rest of this article outlines some of these drawbacks.

Spooky action at a distance

Among the stated features of global-variables are

Avoid having to pass references explicitly throughout the program in order to let distant parts communicate. Enable a communication by convention scheme, where e.g. different libraries may communicate without code dependencies.

This refers to the fact that two global refs with the same name string will become entangled, no matter where in a program they were declared. This is certainly a bug, not a feature. Untracked interactions between different components are the archetypal defect in software engineering.

Neither is there a clear way for a user of global-variables to opt out of this misfeature. The best you can do is augment your names with some prefix which you hope is unique — the same non-solution used by C libraries. Haskell solves namespace problems with a module system and a package system. global-variables circumvents both.

Still, suppose that you choose "communication by convention" for your library. You'll need to manually document the name and type of every ref used by this communication, since they aren't tracked by the type system. A mismatch (as from a library upgrade) will cause silent breakage. Worse, you need to tell every library user how to initialize your library's own variables, and hope that they do it correctly. When a ref is given different initializers in different declarations, the result is indeterminate.

Type clashes

A polymorphic reference, with a type like ∀ t. IORef t, breaks the type system. You can write a value of one type and then read it with another type. So it's important for global-variables to disallow polymorphic refs. The mechanism it uses is that each declaration is implicitly a family of refs, one for each monomorphic type (via Typeable).

Now consider this reasonable-looking program:

{-# LANGUAGE NoMonomorphismRestriction #-}
import Data.Global  -- from global-variables
import Data.IORef

-- Define the famous factorial function
fact :: Integer -> Integer
fact 0 = 1
fact n = n * fact (n-1)

-- Now use it through a global ref
ref = declareIORef "ref" 0
main = do
    writeIORef ref (length "hello")
    r <- readIORef ref
    print (fact r)

This will print 1, not 120. The ref is written at type Int (the return type of length) and an implicitly different ref is read at type Integer (because of the subsequent call to fact).

You can certainly argue that top-level refs should always be declared with a monomorphic type signature. Indeed, my library enforces this. But global-variables doesn't, and can't. Making type clashes a run-time error would be a step in the right direction.

A common response is that locking should be added in C code; however, concurrent programming in C is cumbersome and dangerous. It's much easier, if a bit ugly, to implement locking on the Haskell side. You could however store an MVar lock in a C global variable via StablePtr. Has anyone done this? ↩

Saturday, October 15, 2011

phosphene: Fractal video feedback as a PC master boot record

A few months ago, I wrote a Julia set fractal renderer for the demo challenge hosted by io.smashthestack.org.

This program runs in 16-bit x86 real mode, without any operating system. It's formatted as a PC master boot record, which is 512 bytes long. Subtracting out space reserved for a partition table, we have only 446 bytes for code and data.

Programming in such a restricted environment is quite a challenge. It's further complicated by real mode's segmented addressing. Indexing an array bigger than 64 kB requires significant extra code — and that goes double for the video frame buffer. With two off-screen buffers and a 640 × 480 × 1 byte video mode, much of my code is devoted to segment juggling.

I spent a long time playing with code compression. In the end, I couldn't find a scheme which justifies the fixed size cost of its own decoder. It seems that 16-bit x86 machine code is actually pretty information-dense. For a bigger demo or 32-bit mode (with bigger immediate operands) I'd definitely want compression.

It's totally feasible to enter 32-bit protected mode within 446 bytes, but there's little gained by doing so. You lose easy access to the PC BIOS, which is the only thing you have that resembles an operating system or standard library.

You can browse the assembly source code or grab the MBR itself. It runs well in QEMU, with or without KVM, and I also tested it on a few real machines via USB boot. With QEMU on Linux it's as simple as

$ qemu -hda phosphene.mbr

Thanks to Michael Rule for the original idea and for tons of help with tweaking the rendering algorithm. His writeup has more information about this project.

Wednesday, October 12, 2011

Slides from "Why learn Haskell?"

Yesterday I gave a talk on the topic of "Why learn Haskell?", and I've posted the slides [PDF]. Thanks to MIT's SIPB for organizing these talks and providing tasty snacks. Thanks also to the Boston Haskell group for lots of useful feedback towards improving my talk.

If you'd like to adapt my slides for your own venue, the source is available under a CC-By-SA license. I rendered the PDF using pandoc and Beamer.

Monday, October 10, 2011

shqq: Embedding shell commands in Haskell code

Shell scripts make it easy to pass data between external commands. But shell script as a programming language lacks features like non-trivial data structures and easy, robust concurrency. These would be useful in building quick solutions to system administration and automation problems.

As others have noted,¹²³⁴⁵ Haskell is an interesting alternative for these scripting tasks. I wrote the shqq library to make it a little easier to invoke external programs from Haskell. With the sh quasiquoter, you write a shell command which embeds Haskell variables, execute it as an IO action, and get the command's standard output as a String. In other words, it's a bit like the backtick operator from Perl or Ruby.

Here's a small example:

$ ghci -XQuasiQuotes
λ> import System.ShQQ
λ> let x = "/proc/uptime" in [sh| sha1sum $x |]
"337ec3fb998fb3a4650a18e0785f0992762b3cda  /proc/uptime\n"

shqq also handles escaping for you, so that the shell will not interpret special characters from Haskell variables. You can override this behavior when desired.

The big caveat is that shqq refuses to build on GHC 7.0 or earlier, due to a Unicode-handling bug in process-1.0. You'll need GHC 7.2 or later.

Finding duplicate files

As an example, here's a program to find duplicate files in the directory tree rooted at the current working directory.

{-# LANGUAGE QuasiQuotes, TupleSections #-}
import Control.Concurrent.Spawn  -- package spawn-0.3
import Data.List.Split           -- package split-0.1
import System.ShQQ               -- package shqq-0.1
import System.Posix.Files
import qualified Data.Map as M

First, the computation itself. If we pair each file with a key, such as size or checksum, we can find the groups of (potentially) duplicate files.

dupes :: (Ord k) => [(FilePath,k)] -> [[FilePath]]
dupes = filter (not . null . drop 1) . M.elems
      . foldr (\(v,k) -> M.insertWith (++) k [v]) M.empty

We want to examine files in parallel, but the operating system will complain if we have too many open files. We limit each pass to have at most 256 tests in progress at once.

inParallel :: (a -> IO b) -> [a] -> IO [b]
inParallel f xs = do p <- pool 256; parMapIO (p . f) xs

For efficiency, we find potential duplicates by size, and then checksum only these files. We use external shell commands for checksumming as well as the initial directory traversal. At the end we print the names of duplicated files, one per line, with a blank line after each group of duplicates.

main :: IO ()
main = do
    files <- endBy "\0" `fmap` [sh| find -type f -print0 |]

    let getSize f = ((f,) . fileSize) `fmap` getFileStatus f
    sizeDupes <- dupes `fmap` inParallel getSize files

    let getSha f = ((f,) . head . words) `fmap` [sh| sha1sum $f |]
    shaDupes  <- dupes `fmap` inParallel getSha (concat sizeDupes)

    mapM_ (mapM_ putStrLn . (++[""])) shaDupes

And we run it like so:

$ ghc -O -threaded -rtsopts dupe.hs
$ ./dupe +RTS -N

I included type signatures for clarity, but you wouldn't need them in a one-off script. Not counting imports and the LANGUAGE pragma, that makes 10 lines of code total. I'm pretty happy with the expressiveness of this solution, especially the use of parallel IO for an easy speedup.

Thanks to Dylan Lukes for the original idea for this library.

Tuesday, September 13, 2011

debug-diff: Colorized diffs between Haskell values

Today I uploaded a small library for comparing Haskell values using a textual diff. This has served me well when diagnosing failing tests, or comparing a function with a proposed replacement.

Let's try it on some disassembled x86 code:

$ ghci
λ> :m + Hdis86 Debug.Diff Data.ByteString 
λ> let f = disassemble amd64 . pack 
λ> diff (f [0x31,0xff,0x41,0x5e,0x48,0x89,0xc7]) 
        (f [0x31,0xff,0x41,0x5c,0x48,0x89,0xc7])
--- /tmp/ddiff_x15061   2011-09-13 22:52:13.911842969 -0400 
+++ /tmp/ddiff_y15061   2011-09-13 22:52:13.911842969 -0400 
@@ -1,3 +1,3 @@
 [Inst [] Ixor [Reg (Reg32 RDI), Reg (Reg32 RDI)],
- Inst [Rex] Ipop [Reg (Reg64 R14)], 
+ Inst [Rex] Ipop [Reg (Reg64 R12)], 
  Inst [Rex] Imov [Reg (Reg64 RDI), Reg (Reg64 RAX)]]

This uses groom as a pretty-printer, and shells out to colordiff by default.

Sadly, a textual diff does not always produce usable results. I'm also interested in generic diffs for algebraic data types. There are some packages for this (1, 2) but I haven't yet learned how to use them. This could also be a good application for the new generics support in GHC 7.2.