Monday, October 31, 2011

Thunks and lazy blackholes: an introduction to GHC at runtime

This article is about a GHC bug I encountered recently, but it's really an excuse to talk about some GHC internals at an intro level. (In turn, an excuse for me to learn about those internals.)

I'll assume you're familiar with the basics of Haskell and lazy evaluation.

The bug

I spoke before of using global locks in Haskell to protect a thread-unsafe C library. Unfortunately a GHC bug prevents this from working. Using unsafePerformIO at the top level of a file can result in IO that happens more than once.

Here is a simple program which illustrates the problem:

import Control.Concurrent
import Control.Monad
import System.IO.Unsafe

ioThunk :: ()
ioThunk = unsafePerformIO $ do
me <- myThreadId
putStrLn ("IO executed by " ++ show me)
{-# NOINLINE ioThunk #-}

main :: IO ()
main = do
replicateM_ 100 (forkIO (print ioThunk))
threadDelay 10000 -- wait for other threads

Let's test this, following the compiler flag recommendations for unsafePerformIO.

$ ghc -V
The Glorious Glasgow Haskell Compilation System, version 7.2.1

$ ghc -rtsopts -threaded -fno-cse -fno-full-laziness dupe.hs
[1 of 1] Compiling Main             ( dupe.hs, dupe.o )
Linking dupe ...

$ while true; do ./dupe +RTS -N | head -n 2; echo ----; done

Within a few seconds I see output like this:

----
IO executed by ThreadId 35
()
----
IO executed by ThreadId 78
IO executed by ThreadId 85
----
IO executed by ThreadId 48
()
----

In the middle run, two threads executed the IO action.

This bug was reported two weeks ago and is already fixed in GHC HEAD. I tested with GHC 7.3.20111026, aka g6f5b798, and the problem seemed to go away.

Unfortunately it will be some time before GHC 7.4 is widely deployed, so I'm thinking about workarounds for my original global locking problem. I'll probably store the lock in a C global variable via StablePtr, or failing that, implement all locking in C. But I'd appreciate any other suggestions.

The remainder of this article is an attempt to explain this GHC bug, and the fix committed by Simon Marlow. It's long because

  • I try not to assume you know anything about how GHC works. I don't know very much, myself.

  • There are various digressions.

Objects at runtime

Code produced by GHC can allocate many kinds of objects. Here are just a few:

  • CONSTR objects represent algebraic data constructors and their associated fields. The value (Just 'x') would be represented by a CONSTR object, holding a pointer to another object representing 'x'.

  • FUN objects represent functions, like the value (\x -> x+1).

  • THUNK objects represent computations which have not yet happened. Suppose we write:

    let x = 2 + 2 in f x x

    This code will construct a THUNK object for x and pass it to the code for f. Some time later, f may force evaluation of its argument, and the thunk will, in turn, invoke (+). When the thunk has finished evaluating, it is overwritten with the evaluation result. (Here, this might be an I# CONSTR holding the number 4.) If f then forces its second argument, which is also x, the work done by (+) is not repeated. This is the essence of lazy evaluation.

  • When a thunk is forced, it's first overwritten with a BLACKHOLE object. This BLACKHOLE is eventually replaced with the evaluation result. Therefore a BLACKHOLE represents a thunk which is currently being evaluated.

    Identifying this case helps the garbage collector, and it also gives GHC its seemingly magical ability to detect some infinite loops. Forcing a BLACKHOLE indicates a computation which cannot proceed until the same computation has finished. The GHC runtime will terminate the program with a <<loop>> exception.

  • We can't truly update thunks in place, because the evaluation result might be larger than the space originally allocated for the thunk. So we write an indirection pointing to the evaluation result. These IND objects will later be removed by the garbage collector.

Static objects

Dynamically-allocated objects make sense for values which are created as your program runs. But the top-level declarations in a Haskell module don't need to be dynamically allocated; they already exist when your program starts up. GHC allocates these static objects in your executable's data section, the same place where C global variables live.

Consider this program:

x = Just 'x'

f (Just _) = \y -> y+1

main = print (f x 3)

Ignoring optimizations, GHC will produce code where:

  • x is a CONSTR_STATIC object.

  • f is a FUN_STATIC object. When called, f will return a dynamically-allocated FUN object representing (\y -> y+1).

  • main is a THUNK_STATIC object. It represents the unevaluated expression formed by applying the function print to the argument (f x 3). A static thunk is also known as a constant applicative form, or a CAF for short. Like any other thunk, a CAF may or may not get evaluated. If evaluated, it will be replaced with a black hole and eventually the evaluation result. In this example, main will be evaluated by the runtime system, in deciding what IO to perform.

Black holes and revelations

That's all fine for a single-threaded Haskell runtime, but GHC supports running many Haskell threads across multiple OS threads. This introduces some additional complications. For example, one thread might force a thunk which is currently being evaluated by another thread. The thread will find a BLACKHOLE, but terminating the program would be incorrect. Instead the BLACKHOLE puts the current Haskell thread to sleep, and wakes it up when the evaluation result is ready.

If two threads force the same thunk at the same time, they will both perform the deferred computation. We could avoid this wasted effort by writing and checking for black holes using expensive atomic memory operations. But this is a poor tradeoff; we slow down every evaluation in order to prevent a rare race condition.

As a compiler for a language with pure evaluation, GHC has the luxury of tolerating some duplicated computation. Evaluating an expression twice can't change a program's behavior. And most thunks are cheap to evaluate, hardly worth the effort of avoiding duplication. So GHC follows a "lazy black-holing" strategy.12 Threads write black holes only when they enter the garbage collector. If a thread discovers that one of its thunks has already been claimed, it will abandon the duplicated work-in-progress. This scheme avoids large wasted computations without paying the price on small computations. You can find the gritty details within the function threadPaused, in rts/ThreadPaused.c.

unsafe[Dupable]PerformIO

You may remember that we started, all those many words ago, with a program that uses unsafePerformIO. This breaks the pure-evaluation property of Haskell. Repeated evaluation will affect semantics! Might lazy black-holing be the culprit in the original bug?

Naturally, the GHC developers thought about this case. Here's the implementation of unsafePerformIO:

unsafePerformIO m = unsafeDupablePerformIO (noDuplicate >> m)

noDuplicate = IO $ \s -> case noDuplicate# s of s' -> (# s', () #)

unsafeDupablePerformIO (IO m) = lazy (case m realWorld# of (# _, r #) -> r)

The core behavior is implemented by unsafeDupablePerformIO, using GHC's internal representation of IO actions (which is beyond the scope of this article, to the extent I even have a scope in mind). As the name suggests, unsafeDupablePerformIO provides no guarantee against duplicate execution. The more familiar unsafePerformIO builds this guarantee by first invoking the noDuplicate# primitive operation.

The implementation of noDuplicate#, written in GHC's Cmm intermediate language, handles a few tricky considerations. But it's basically a call to the function threadPaused, which we saw is responsible for lazy black-holing. In other words, thunks built from unsafePerformIO perform eager black-holing.

Since threadPaused has to walk the evaluation stack, unsafeDupablePerformIO might be much faster than unsafePerformIO. In practice, this will matter when performing a great number of very quick IO actions, like peeking a single byte from memory. In this case it is safe to duplicate IO, provided the buffer is unchanging. Let's measure the performance difference.

import GHC.IO
import Foreign hiding (unsafePerformIO)
import System.Random
import Criterion.Main

main = do
let sz = 1024*1024
buf <- mallocBytes sz
let get i = peekByteOff buf i :: IO Word8
peek_d i = unsafeDupablePerformIO (get i)
peek_n i = unsafePerformIO (get i)
idxes = take 1024 $ randomRs (0, sz-1) (mkStdGen 49)
evaluate (sum idxes) -- force idxes ahead of time
defaultMain
[ bench "dup" $ nf (map peek_d) idxes
, bench "noDup" $ nf (map peek_n) idxes ]

And the results:

$ ghc -rtsopts -threaded -O2 peek.hs && ./peek +RTS -N
...

benchmarking dup
mean: 76.42962 us, lb 75.11134 us, ub 78.18593 us, ci 0.950
std dev: 7.764123 us, lb 6.300310 us, ub 9.790345 us, ci 0.950

benchmarking noDup
mean: 142.1720 us, lb 139.7312 us, ub 145.4300 us, ci 0.950
std dev: 14.43673 us, lb 11.40254 us, ub 17.86663 us, ci 0.950

So performance-critical idempotent actions can benefit from unsafeDupablePerformIO. But most code should use the safer unsafePerformIO, as our bug reproducer does. And the noDuplicate# machinery for unsafePerformIO makes sense, so what's causing our bug?

The bug, finally

After all those details and diversions, let's go back to the fix for GHC bug #5558. The action is mostly in rts/sm/Storage.c. This file is part of GHC's storage manager, which provides services such as garbage collection.

Recall that our problematic code looked like this:

ioThunk :: ()
ioThunk = unsafePerformIO $ do ...

This is an application of the function ($) to the argument unsafePerformIO. So it's a static thunk, a CAF. Here's the old description of how CAF evaluation works, from Storage.c:

The entry code for every CAF does the following:

  • builds a BLACKHOLE in the heap
  • pushes an update frame pointing to the BLACKHOLE
  • calls newCaf, below
  • updates the CAF with a static indirection to the BLACKHOLE

Why do we build an BLACKHOLE in the heap rather than just updating the thunk directly? It's so that we only need one kind of update frame - otherwise we'd need a static version of the update frame too.

So here's the problem. Normal thunks get blackholed in place, and a thread detects duplicated evaluation by noticing that one of its thunks-in-progress became a BLACKHOLE. But static thunks — CAFs — are blackholed by indirection. Two threads might perform the above procedure concurrently, producing two different heap-allocated BLACKHOLEs, and they'd never notice.

As Simon Marlow put it:

Note [atomic CAF entry]

With THREADED_RTS, newCaf() is required to be atomic (see #5558). This is because if two threads happened to enter the same CAF simultaneously, they would create two distinct CAF_BLACKHOLEs, and so the normal threadPaused() machinery for detecting duplicate evaluation will not detect this. Hence in lockCAF() below, we atomically lock the CAF with WHITEHOLE before updating it with IND_STATIC, and return zero if another thread locked the CAF first. In the event that we lost the race, CAF entry code will re-enter the CAF and block on the other thread's CAF_BLACKHOLE.

I can't explain precisely what a WHITEHOLE means, but they're used for spin locks or wait-free synchronization in various places. For example, the MVar primitives are synchronized by the lockClosure spinlock routine, which uses WHITEHOLEs.

The fix

Here's the corrected CAF evaluation procedure:

The entry code for every CAF does the following:

  • builds a CAF_BLACKHOLE in the heap
  • calls newCaf, which atomically updates the CAF with IND_STATIC pointing to the CAF_BLACKHOLE
  • if newCaf returns zero, it re-enters the CAF (see Note [atomic CAF entry])
  • pushes an update frame pointing to the CAF_BLACKHOLE

newCAF is made atomic by introducing a new helper function, lockCAF, which is reproduced here for your viewing pleasure:

STATIC_INLINE StgWord lockCAF (StgClosure *caf, StgClosure *bh)
{
const StgInfoTable *orig_info;

orig_info = caf->header.info;

#ifdef THREADED_RTS
const StgInfoTable *cur_info;

if (orig_info == &stg_IND_STATIC_info ||
orig_info == &stg_WHITEHOLE_info) {
// already claimed by another thread; re-enter the CAF
return 0;
}

cur_info = (const StgInfoTable *)
cas((StgVolatilePtr)&caf->header.info,
(StgWord)orig_info,
(StgWord)&stg_WHITEHOLE_info);

if (cur_info != orig_info) {
// already claimed by another thread; re-enter the CAF
return 0;
}

// successfully claimed by us; overwrite with IND_STATIC
#endif

// For the benefit of revertCAFs(), save the original info pointer
((StgIndStatic *)caf)->saved_info = orig_info;

((StgIndStatic*)caf)->indirectee = bh;
write_barrier();
SET_INFO(caf,&stg_IND_STATIC_info);

return 1;
}

We grab the CAF's info table pointer, which tells us what kind of object it is. If it's not already claimed by another thread, we write a WHITEHOLE — but only if the CAF hasn't changed in the meantime. This step is an atomic compare-and-swap, implemented by architecture-specific code. The function cas is specified by this pseudocode:

cas(p,o,n) {
atomically {
r = *p;
if (r == o) { *p = n };
return r;
}
}

Here's the implementation for x86, using GCC extended inline assembly:

EXTERN_INLINE StgWord
cas(StgVolatilePtr p, StgWord o, StgWord n)
{
__asm__ __volatile__ (
"lock\ncmpxchg %3,%1"
:"=a"(o), "=m" (*(volatile unsigned int *)p)
:"0" (o), "r" (n));
return o;
}

There are some interesting variations between architectures. SPARC and x86 use single instructions, while PowerPC and ARMv6 have longer sequences. Old ARM processors require a global spinlock, which sounds painful. Who's running Haskell on ARMv5 chips?

*deep breath*

Thanks for reading / skimming this far! I learned a lot by writing this article, and I hope you enjoyed reading it. I'm sure I said something wrong somewhere, so please do not hesitate to correct me in the comments.


  1. Tim Harris, Simon Marlow, and Simon Peyton Jones. Haskell on a shared-memory multiprocessor. In Haskell '05: Proceedings of the 2005 ACM SIGPLAN workshop on Haskell, pages 49–61.

  2. Simon Marlow, Simon Peyton Jones, and Satnam Singh. Runtime Support for Multicore Haskell. In ICFP'09.

Monday, October 24, 2011

Quasicrystals as sums of waves in the plane

On the suggestion of a friend, I rendered this animation:

This quasicrystal is full of emergent patterns, but it can be described in a simple way. Imagine that every point in the plane is shaded according to the cosine of its y coordinate. The result would look like this:

Now we can rotate this image to get other waves, like these:

Each frame of the animation is a summation of such waves at evenly-spaced rotations. The animation occurs as each wave moves forward.

I recommend viewing it up close, and then from a few feet back. There are different patterns at each spatial scale.

The code

To render this animation I wrote a Haskell program, using the Repa array library. For my purposes, the advantages of Repa are:

  • Immutable arrays, supporting clean, expressive code
  • A fast implementation, including automatic parallelization
  • Easy output to image files, via repa-devil

Here is a simplified (but complete!) program, which renders a single still image.

import Data.Array.Repa ( Array, DIM2, DIM3, Z(..), (:.)(..) )
import qualified Data.Array.Repa as R
import qualified Data.Array.Repa.IO.DevIL as D

import Data.Word ( Word8 )
import Data.Fixed ( divMod' )

For clarity, we define a few type synonyms:

type R     = Float
type R2 = (R, R)
type Angle = R

We'll convert pixel indices to coordinates in the real plane, with origin at the image center. We have to decide how many pixels to draw, and how much of the plane to show.

pixels :: Int
pixels = 800
scale :: R
scale = 128

Repa's array indices are "snoc lists" of the form (Z :. x :. y). By contrast, our planar coordinates are conventional tuples.

point :: DIM2 -> R2
point = \(Z :. x :. y) -> (adj x, adj y) where
adj n = scale * ((2 * fromIntegral n / denom) - 1)
denom = fromIntegral pixels - 1

A single wave is a cosine depending on x and y coordinates in some proportion, determined by the wave's orientation angle.

wave :: Angle -> R2 -> R
wave th = f where
(cth, sth) = (cos th, sin th)
f (x,y) = (cos (cth*x + sth*y) + 1) / 2

To combine several functions, we sum their outputs, and wrap to produce a result between 0 and 1. As n increases, (wrap n) will rise to 1, fall back to 0, rise again, and so on. sequence converts a list of functions to a function returning a list, using the monad instance for ((->) r).

combine :: [R2 -> R] -> (R2 -> R)
combine xs = wrap . sum . sequence xs where
wrap n = case divMod' n 1 of
(k, v) | odd k -> 1-v
| otherwise -> v

To draw the quasicrystal, we combine waves at 7 angles evenly spaced between 0 and π.

angles :: Int -> [Angle]
angles n = take n $ enumFromThen 0 (pi / fromIntegral n)

quasicrystal :: DIM2 -> R
quasicrystal = combine (map wave (angles 7)) . point

We convert an array of floating-point values to an image in two steps. First, we map floats in [0,1] to bytes in [0,255]. Then we copy this to every color channel. The result is a 3-dimensional array, indexed by (row, column, channel). repa-devil takes such an array and outputs a PNG image file.

toImage :: Array DIM2 R -> Array DIM3 Word8
toImage arr = R.traverse arr8 (:. 4) chans where
arr8 = R.map (floor . (*255) . min 1 . max 0) arr
chans _ (Z :. _ :. _ :. 3) = 255 -- alpha channel
chans a (Z :. x :. y :. _) = a (Z :. x :. y)

main :: IO ()
main = do
let arr = R.fromFunction (Z :. pixels :. pixels) quasicrystal
D.runIL $ D.writeImage "out.png" (toImage arr)

Running it

Repa's array operations automatically run in parallel. We just need to enable GHC's threaded runtime.

$ ghc -O2 -rtsopts -threaded quasicrystal.lhs
$ ./quasicrystal +RTS -N
$ xview out.png

And it looks like this:

Note that repa-devil silently refuses to overwrite an existing file, so you may need to rm out.png first.

On my 6-core machine, this parallel code ran in 3.72 seconds of wall-clock time, at a CPU utilization of 474%. The same code compiled without -threaded took 14.20 seconds, so the net efficiency of parallelization is 382%. This is a good result; what's better is how little work it required on my part. Cutting a mere 10 seconds from a single run is not a big deal. But it starts to matter when rendering many frames of animation, and trying out variations on the algorithm.

As a side note, switching from Float to Double increased the run time by about 30%. I suspect this is due to increased demand for memory bandwidth and cache space.

You can grab the Literate Haskell source and try it out on your own machine. This is my first Repa program ever, so I'd much appreciate feedback on improving the code.

Be sure to check out Michael Rule's work on animating quasicrystals.

Friday, October 21, 2011

Interfacing Haskell to the Concorde solver for Traveling Salesperson Problem

The Traveling Salesperson Problem (TSP) is a famous optimization problem with applications in logistics, manufacturing, and art. In its planar form, we are given a set of "cities", and we want to visit each city while minimizing the total travel distance.

Finding the shortest possible tour is NP-hard, and quickly becomes infeasible as the number of cities grows. But most applications need only a heuristically good solution: a tour which is short, if not the shortest possible. The Lin-Kernighan heuristic quickly produces such tours.

The Concorde project provides a well-regarded collection of TSP solvers. I needed TSP heuristics for a Haskell project, so I wrote a Haskell interface to Concorde's Lin-Kernighan implementation. Concorde provides a C library, but it's far from clear how to use it. Instead I chose to invoke the linkern executable as a subprocess.

The core of the Haskell interface looks like this:

tsp
:: Config -- provides various configurable parameters
-> (a -> R2) -- gives the rectangular coordinates of each point
-> [a] -- list of points to visit
-> IO [a] -- produces points permuted in tour order

tsp lets you represent the points to visit using any type you like. You just provide a function to get the coordinates of each point. The Config parameter controls various aspects of the computation, including the time/quality tradeoff. Defaults are provided, and you can override these selectively using record-update syntax. All considered it's a pretty simple interface which tries to hide the complexity of interacting with an external program.

Visualizing a tour

Here's a example program which computes a tour of 1,000 random points. We'll visualize the tour using the Diagrams library.

import Diagrams.Prelude
( Diagram , Point(P), fillColor , lineWidth
, translate, circle , fromVertices, lightgrey )

import Diagrams.Backend.Cairo.CmdLine
( Cairo, defaultMain )

import Data.Colour.SRGB ( sRGB )
import Data.Colour.RGBSpace ( uncurryRGB )
import Data.Colour.RGBSpace.HSV ( hsv )

import qualified Algorithms.Concorde.LinKern as T

import Control.Monad
import Data.Monoid
import System.Random

tsp takes a list of points and a function to extract the coordinates of a point. Our points are just the coordinates themselves, so we pass the identity function.

type R2 = (Double, Double)

findTour :: [R2] -> IO [R2]
findTour = T.tsp cfg id where
cfg = T.defConfig { T.verbose = True }

The tour is drawn as a loop of line segements. We also shade the interior of this polygon.

diaTour :: [R2] -> Diagram Cairo R2
diaTour xs@(x:_) = sty . fromVertices $ map P (xs ++ [x]) where
sty = fillColor lightgrey . lineWidth 10

Each point visited by the tour is drawn as a circle, with hue indicating its position in the tour.

diaPoints :: [R2] -> Diagram Cairo R2
diaPoints = mconcat . map circ . zip [0..] where
n = fromIntegral numPoints
circ (i,p) = translate p . fillColor color $ circle 40
where color = uncurryRGB sRGB (hsv (360*i/n) 1 1)

Now we put it all together. Note that linkern uses Euclidean distances rounded to the nearest integer. So we need coordinates with fairly large magnitudes. Picking values between 0 and 1 won't work.

numPoints :: Int
numPoints = 1000

main :: IO ()
main = do
let rnd = randomRIO (0,10000)
points <- replicateM numPoints (liftM2 (,) rnd rnd)
tour <- findTour points
defaultMain (diaPoints tour `mappend` diaTour tour)

We run it like so:

$ export PATH=~/concorde-031219/LINKERN:$PATH
$ runhaskell tour.lhs -o out.pdf
$ xpdf out.pdf

The computation takes about 2 seconds on my machine. And the output looks like this:

You can download this post as a Literate Haskell file and run the above program. You'll need to install the concorde and diagrams packages.

The source for the concorde Haskell package includes a more full-featured version of this example.

Tuesday, October 18, 2011

Safe top-level mutable variables for Haskell

I uploaded the safe-globals package for Haskell, which lets you declare top-level mutable variables like so:

{-# LANGUAGE TemplateHaskell #-}
import Data.Global
import Data.IORef

declareIORef "ref"
[t| Int |]
[e| 3 |]

main = do
readIORef ref >>= print -- prints 3
writeIORef ref 5
readIORef ref >>= print -- prints 5

This creates a module-level binding

ref :: IORef Int

which can be managed through the usual module import/export mechanism.

Why global state?

Global state is a sign of bad software design, especially in Haskell. Why would we ever need it? Suppose you're wrapping a C library which is not thread-safe. Using a (hidden!) global lock, you can expose an interface which is simple and safe. In other words, you're using global state to compensate for others using global state.1 Another use case is generating unique identifiers to speed up comparison of values. This can be done without breaking referential transparency, but you need a source of IDs which is really and truly global.

In these situations it's typical to create global variables using a hack such as

ref :: IORef Int
ref = unsafePerformIO (newIORef 3)
{-# NOINLINE ref #-}

My library is just a set of Template Haskell macros for the same hack. If global variables are seldom needed, then what good are these macros?

  • Writing out the hack each time is unsafe. I might forget the NOINLINE pragma, or subvert the type system with a polymorphic reference. The safe-globals library prevents these mistakes. I'm of the opinion — and I know it's not shared by all — that even questionable techniques should be made as safe as possible. Call it "harm reduction" if you like.

  • In ten years, if GHC 9 requires an extra pragma for safety, then safe-globals can be updated, without changing every package that uses it. If JHC's ACIO feature is ported to GHC, then safe-globals can take advantage and get rid of the hacks entirely.

But the direct impetus to write safe-globals was the appearance of the global-variables library, which drew some attention in the Haskell community. global-variables aims to solve the same problem, using a different approach with a number of drawbacks. The rest of this article outlines some of these drawbacks.

Spooky action at a distance

Among the stated features of global-variables are

Avoid having to pass references explicitly throughout the program in order to let distant parts communicate. Enable a communication by convention scheme, where e.g. different libraries may communicate without code dependencies.

This refers to the fact that two global refs with the same name string will become entangled, no matter where in a program they were declared. This is certainly a bug, not a feature. Untracked interactions between different components are the archetypal defect in software engineering.

Neither is there a clear way for a user of global-variables to opt out of this misfeature. The best you can do is augment your names with some prefix which you hope is unique — the same non-solution used by C libraries. Haskell solves namespace problems with a module system and a package system. global-variables circumvents both.

Still, suppose that you choose "communication by convention" for your library. You'll need to manually document the name and type of every ref used by this communication, since they aren't tracked by the type system. A mismatch (as from a library upgrade) will cause silent breakage. Worse, you need to tell every library user how to initialize your library's own variables, and hope that they do it correctly. When a ref is given different initializers in different declarations, the result is indeterminate.

Type clashes

A polymorphic reference, with a type like ∀ t. IORef t, breaks the type system. You can write a value of one type and then read it with another type. So it's important for global-variables to disallow polymorphic refs. The mechanism it uses is that each declaration is implicitly a family of refs, one for each monomorphic type (via Typeable).

Now consider this reasonable-looking program:

{-# LANGUAGE NoMonomorphismRestriction #-}
import Data.Global -- from global-variables
import Data.IORef

-- Define the famous factorial function
fact :: Integer -> Integer
fact 0 = 1
fact n = n * fact (n-1)

-- Now use it through a global ref
ref = declareIORef "ref" 0
main = do
writeIORef ref (length "hello")
r <- readIORef ref
print (fact r)

This will print 1, not 120. The ref is written at type Int (the return type of length) and an implicitly different ref is read at type Integer (because of the subsequent call to fact).

You can certainly argue that top-level refs should always be declared with a monomorphic type signature. Indeed, my library enforces this. But global-variables doesn't, and can't. Making type clashes a run-time error would be a step in the right direction.


  1. A common response is that locking should be added in C code; however, concurrent programming in C is cumbersome and dangerous. It's much easier, if a bit ugly, to implement locking on the Haskell side. You could however store an MVar lock in a C global variable via StablePtr. Has anyone done this?

Saturday, October 15, 2011

phosphene: Fractal video feedback as a PC master boot record

A few months ago, I wrote a Julia set fractal renderer for the demo challenge hosted by io.smashthestack.org.

This program runs in 16-bit x86 real mode, without any operating system. It's formatted as a PC master boot record, which is 512 bytes long. Subtracting out space reserved for a partition table, we have only 446 bytes for code and data.

Programming in such a restricted environment is quite a challenge. It's further complicated by real mode's segmented addressing. Indexing an array bigger than 64 kB requires significant extra code — and that goes double for the video frame buffer. With two off-screen buffers and a 640 × 480 × 1 byte video mode, much of my code is devoted to segment juggling.

I spent a long time playing with code compression. In the end, I couldn't find a scheme which justifies the fixed size cost of its own decoder. It seems that 16-bit x86 machine code is actually pretty information-dense. For a bigger demo or 32-bit mode (with bigger immediate operands) I'd definitely want compression.

It's totally feasible to enter 32-bit protected mode within 446 bytes, but there's little gained by doing so. You lose easy access to the PC BIOS, which is the only thing you have that resembles an operating system or standard library.

You can browse the assembly source code or grab the MBR itself. It runs well in QEMU, with or without KVM, and I also tested it on a few real machines via USB boot. With QEMU on Linux it's as simple as

$ qemu -hda phosphene.mbr

Thanks to Michael Rule for the original idea and for tons of help with tweaking the rendering algorithm. His writeup has more information about this project.

Wednesday, October 12, 2011

Slides from "Why learn Haskell?"

Yesterday I gave a talk on the topic of "Why learn Haskell?", and I've posted the slides [PDF]. Thanks to MIT's SIPB for organizing these talks and providing tasty snacks. Thanks also to the Boston Haskell group for lots of useful feedback towards improving my talk.

If you'd like to adapt my slides for your own venue, the source is available under a CC-By-SA license. I rendered the PDF using pandoc and Beamer.

Monday, October 10, 2011

shqq: Embedding shell commands in Haskell code

Shell scripts make it easy to pass data between external commands. But shell script as a programming language lacks features like non-trivial data structures and easy, robust concurrency. These would be useful in building quick solutions to system administration and automation problems.

As others have noted,12345 Haskell is an interesting alternative for these scripting tasks. I wrote the shqq library to make it a little easier to invoke external programs from Haskell. With the sh quasiquoter, you write a shell command which embeds Haskell variables, execute it as an IO action, and get the command's standard output as a String. In other words, it's a bit like the backtick operator from Perl or Ruby.

Here's a small example:

$ ghci -XQuasiQuotes
λ> import System.ShQQ
λ> let x = "/proc/uptime" in [sh| sha1sum $x |]
"337ec3fb998fb3a4650a18e0785f0992762b3cda  /proc/uptime\n"

shqq also handles escaping for you, so that the shell will not interpret special characters from Haskell variables. You can override this behavior when desired.

The big caveat is that shqq refuses to build on GHC 7.0 or earlier, due to a Unicode-handling bug in process-1.0. You'll need GHC 7.2 or later.

Finding duplicate files

As an example, here's a program to find duplicate files in the directory tree rooted at the current working directory.

{-# LANGUAGE QuasiQuotes, TupleSections #-}
import Control.Concurrent.Spawn -- package spawn-0.3
import Data.List.Split -- package split-0.1
import System.ShQQ -- package shqq-0.1
import System.Posix.Files
import qualified Data.Map as M

First, the computation itself. If we pair each file with a key, such as size or checksum, we can find the groups of (potentially) duplicate files.

dupes :: (Ord k) => [(FilePath,k)] -> [[FilePath]]
dupes = filter (not . null . drop 1) . M.elems
. foldr (\(v,k) -> M.insertWith (++) k [v]) M.empty

We want to examine files in parallel, but the operating system will complain if we have too many open files. We limit each pass to have at most 256 tests in progress at once.

inParallel :: (a -> IO b) -> [a] -> IO [b]
inParallel f xs = do p <- pool 256; parMapIO (p . f) xs

For efficiency, we find potential duplicates by size, and then checksum only these files. We use external shell commands for checksumming as well as the initial directory traversal. At the end we print the names of duplicated files, one per line, with a blank line after each group of duplicates.

main :: IO ()
main = do
files <- endBy "\0" `fmap` [sh| find -type f -print0 |]

let getSize f = ((f,) . fileSize) `fmap` getFileStatus f
sizeDupes <- dupes `fmap` inParallel getSize files

let getSha f = ((f,) . head . words) `fmap` [sh| sha1sum $f |]
shaDupes <- dupes `fmap` inParallel getSha (concat sizeDupes)

mapM_ (mapM_ putStrLn . (++[""])) shaDupes

And we run it like so:

$ ghc -O -threaded -rtsopts dupe.hs
$ ./dupe +RTS -N

I included type signatures for clarity, but you wouldn't need them in a one-off script. Not counting imports and the LANGUAGE pragma, that makes 10 lines of code total. I'm pretty happy with the expressiveness of this solution, especially the use of parallel IO for an easy speedup.

Thanks to Dylan Lukes for the original idea for this library.