Saturday, December 4, 2010

Type-level Fix and generic folds

This article describes a fixpoint combinator for Haskell types, with some justification of why such a thing would be useful. It's meant to be an accessible introduction to material which is already covered elsewhere.

You'll need some basic Haskell knowledge, such as how to define and use polymorphic data types. And you might want to first study the value-level `fix` function. An excellent introduction can be found here.

This code is standard Haskell 98, except in the "GHC tricks" section.

Recursive types

Many useful data types are recursive:

``data Natural = Zero | Succ Natural     -- natural numbersdata List a  = Nil  | Cons a (List a)  -- listsdata Tree a  = Node a [Tree a]         -- rose trees``

In each case, the type declared on the left-hand side of "`=`" is mentioned on the right-hand side. This is fine, but just for fun let's see if there's another way.

We can imagine a language with a keyword "`self`", which refers to the type being defined:

``data Natural = Zero | Succ selfdata List a  = Nil  | Cons a selfdata Tree a  = Node a [self]``

Haskell doesn't have this keyword, so we'll make `self` into a type parameter:

``data NaturalF self = Zero | Succ selfdata ListF a  self = Nil  | Cons a selfdata TreeF a  self = Node a [self]``

We've changed the names because these aren't the types we wanted. They're non-recursive and each has an extra parameter.

To build the recursive type `Natural` using the non-recursive `NaturalF`, we need to "tie the knot":

``type Natural = NaturalF (NaturalF (NaturalF (NaturalF ...)))``

i.e.

``type Natural = NaturalF Natural``

But this is disallowed:

``````foo.hs:3:0:
Cycle in type synonym declarations:
foo.hs:3:0-30: type Natural = NaturalF Natural
``````

Since `type` defines a transparent synonym, this would generate an infinitely-large type-expression. Haskell disallows those. We need to hide the recursion behind a data constructor:

``newtype Natural = MkNat (NaturalF Natural)``

When we use this type, the "wrapper" constructor `MkNat` alternates with the constructors of `NaturalF`:

``toNatural :: Integer -> NaturaltoNatural 0 = MkNat ZerotoNatural n = MkNat . Succ \$ toNatural (n-1)``

Type-level `Fix`

So far this is just pointless complexity. We've replaced one type:

``data Natural = Zero | Succ Natural``

with two types:

``data    NaturalF self = Zero | Succ selfnewtype Natural       = MkNat (NaturalF Natural)``

Notice that the definition of `Natural` doesn't depend on the structure of natural numbers. Let's rename `Natural` to `Fix` and `MkNat` to `In`:

``newtype Fix   = In (NaturalF Fix)``

Then we'll abstract out `NaturalF` as a type parameter `f`:

``newtype Fix f = In (f (Fix f))``

Now `Natural` is just a synonym:

``type Natural = Fix NaturalF``

When we create values of type `Natural`, we still have "wrapper" constructors:

``toNatural :: Integer -> NaturaltoNatural 0 = In ZerotoNatural n = In . Succ \$ toNatural (n-1)``

Properties of `Fix`

Recall that the value-level `fix` function can be defined as

``fix f = f (fix f)``

This looks a lot like our type

``newtype Fix f = In (f (Fix f))``

The similarity is reflected at the type and kind levels:

``````GHCi> :t fix
fix :: (a -> a) -> a
GHCi> :k Fix
Fix :: (* -> *) -> *
``````

`fix` is a function which takes another function. `Fix` is a type constructor which takes another type constructor.

The types `Fix f` and `f (Fix f)` are isomorphic, and we'll want to convert between them. In one direction we'll use the data constructor

``In :: f (Fix f) -> Fix f``

and the other direction is easily defined as

``out :: Fix f -> f (Fix f)out (In x) = x``

Folding lists

So why bother writing types this way? Well, we usually don't. But factoring the recursion out of our data type enables us to factor the recursion out of our functions.

Consider taking the sum of a list:

``sumF :: (Num a) => List a -> asumF (In Nil)         = 0sumF (In (Cons x xs)) = x + sumF xs``

We traverse the structure with the combining function `(+)`, making a recursive call at each recursive position of the data type. How can we capture this pattern?

If we're reducing a value of type `List T` to get a result of type `R`, then a value of type `ListF T R` represents a list node with the recursive call's result already in place. So we might look for a function like

``cata :: (ListF a b -> b) -> List a -> b``

If we had this, we could write

``sumF :: (Num a) => List a -> asumF = cata f where  f Nil = 0  f (Cons x xs) = x + xs``

The function `f` specifies one step of reduction, and the magical function `cata` handles the rest.

The recursion pattern captured by `cata` is very similar to `foldr` on the built-in list type. In fact, we can implement `foldr` with `cata`:

``foldrF :: (a -> b -> b) -> b -> List a -> bfoldrF f z = cata g where  g Nil = z  g (Cons x xs) = f x xs``

If `foldr` is good enough, then what's so special about `cata`? The answer is that `cata` actually has this very general type:

``cata :: (Functor f) => (f a -> a) -> Fix f -> a``

I'll give the one-line definition of `cata` later. Writing it yourself is a fun puzzle.

We can see that we'll need this instance:

``instance Functor (ListF a) where  fmap _ Nil         = Nil  fmap f (Cons x xs) = Cons x (f xs)``

This `fmap` just takes a function and applies it to the `self`-typed field, if any. It's not recursive and has no relation to the `[]` instance of `Functor`, for which `fmap = map`.

Folding natural numbers

Let's put our generic `cata` to good use on another type. Every natural number is either zero, or the successor of another natural number:

``data NaturalF self = Zero | Succ selftype Natural = Fix NaturalF``

Again, `fmap` applies a function to each `self`-typed field:

``instance Functor NaturalF where  fmap _ Zero     = Zero  fmap f (Succ v) = Succ (f v)``

We can then use `cata` to define conversion to `Integer`:

``fromNatural :: Natural -> IntegerfromNatural = cata f where  f Zero     = 0  f (Succ n) = 1 + n``

as well as addition and multiplication:

``add :: Natural -> Natural -> Naturaladd n = cata f where  f Zero     = n  f (Succ m) = In \$ Succ mmul :: Natural -> Natural -> Naturalmul n = cata f where  f Zero     = In Zero  f (Succ m) = add n m``

Testing it:

``````GHCi> fromNatural (add (toNatural 2) (mul (toNatural 3) (toNatural 5)))
17

``````

Folding rose trees

Let's try one more structure, namely rose trees:

``data TreeF a self = Node a [self]type Tree  a      = Fix (TreeF a)instance Functor (TreeF a) where  fmap f (Node v xs) = Node v (map f xs)``

As before, `fmap` applies a function to each `self`-typed field. Since we have a list of these, we use `map`.

We define some traversals:

``valuesF :: Tree a -> [a]valuesF = cata (\(Node x xs) -> x : concat xs)depthF :: Tree a -> IntegerdepthF = cata (\(Node _ xs) -> 1 + maximum (0:xs))``

and test them:

``````GHCi> let n x = In . Node x
GHCi> let t1 = n 'a' []
GHCi> valuesF t1
"a"
GHCi> depthF t1
1
GHCi> let t2 = n 'a' [n 'b' [], n 'c' [n 'd' []], n 'e' []]
GHCi> valuesF t2
"abcde"
GHCi> depthF t2
3

``````

The definition of `cata`

As promised:

``cata :: (Functor f) => (f a -> a) -> Fix f -> acata g = g . fmap (cata g) . out``

Generic programming

We usually manipulate data because we care about its meaning. But some operations only depend on the data's structure. If we have some complex nested data value, we might want to pretty-print it, or serialize it with binary, or collect all the `String`s buried within. These are concepts which apply to any data type you like. Shouldn't our code be similarly adaptable?

This idea is called generic programming. Implementations in Haskell include syb, uniplate, multiplate, instant-generics, GHC's built-in support, and many other efforts.

Each of these libraries demands certain information about each data type, and in return provides some generic operations. In this view, our above development of `Fix` constitutes a really tiny generics library. We require

• that you write recursive types as `Fix F`, where `F` is non-recursive, and
• that you provide `instance Functor F`.

In return, you get the single generic operation `cata`, which reduces a recursive structure to a "summary" value, by replacing each constructor with an arbitrary computation.

Our toy generics library is simple, but it doesn't actually save much work. Using `Fix` adds noise to the code, and our `Functor` instances aren't much shorter than a custom fold function for each type. Conversely, our uses of `cata` aren't much simpler than explicit recursive traversals.

Practicality aside, I think it's cool to define folding "once and for all". I'm interested to learn (from you!) what else can be done with `Fix`'d data or similar techniques.

For a non-trivial generics library built around similar ideas, see multirec and the accompanying paper.

Odds and ends

Nomenclature

`Fix` is sometimes named `Mu`, because the Greek letter μ is used to represent least fixed points.

`cata` stands for catamorphism.

Some of this stuff shows up in "Evolution of a Haskell Programmer". They build a factorial function using not only catamorphisms but also zygomorphisms and paramorphisms. Tell your friends!

In TaPL there's a discussion of equirecursive versus isorecursive type systems. Haskell is in the latter category. `Fix f` is isomorphic to, but not equivalent to, `f (Fix f)`. The functions `In` and `out` provide the isomorphism. If we defined `Fix` using `data` instead of `newtype`, the isomorphism would fail because of the extra value `In ⊥`.

Isomorphisms

`NaturalF` is isomorphic to `Maybe`:

``data NaturalF self = Zero    | Succ selfdata Maybe    a    = Nothing | Just a``

The `Functor` instances are also the same.

`NaturalF` is also isomorphic to `ListF ()`, ignoring undefined values.

GHC tricks

While the above is standard Haskell, we can pull a few more tricks by using a couple of GHC extensions.

GHC could help us reduce boilerplate by deriving `Functor` instances.

With some GHC extensions, we can write `Show` for `Fix`:

``instance (Show (f (Fix f))) => Show (Fix f) where  showsPrec p (In x) = showParen (p >= 11) (showString "In " . showsPrec 11 x)``

Testing it:

``````GHCi> toNatural 3
In (Succ (In (Succ (In (Succ (In Zero))))))
``````

This requires `-XFlexibleContexts` `-XUndecidableInstances`. The latter is required because `f (Fix f)` is a syntactically larger type than `Fix f`, so GHC can't convince itself that instance resolution will definitely terminate. For well-behaved arguments to `Fix`, it works out. We'll have something like

``-- from aboveinstance (Show (f (Fix f)))       => Show (Fix f)-- from 'deriving (Show)' on NaturalFinstance (Show self)       => Show (NaturalF self)``

which instantiates to

``instance (Show (NaturalF (Fix NaturalF)))       => Show (Fix NaturalF)instance (Show (Fix NaturalF))       => Show (NaturalF (Fix NaturalF))``

and GHC is clever enough to resolve the mutual recursion.

Remarkably, GHC can derive this instance with `-XStandaloneDeriving`:

``deriving instance (Show (f (Fix f))) => Show (Fix f)``

That's how I got the precedence-correct code above, via `-ddump-deriv`.

Thursday, November 4, 2010

Obtaining the name of a function in Haskell

Someone on `#haskell` asked whether you could recover a function's name from its value, i.e.

``````GHCi> name id
"id"
GHCi> name map
"map"
``````

This is easy in some languages. But Haskell is not designed to provide this kind of run-time information. We'll need some non-portable hacks. I tested this code with GHC 6.12.1 on `amd64` Linux; see below for portability notes.

How it works

Most values in GHC Haskell are represented by a closure. Closures representing functions are common in functional-language compilers, but GHC extends the idea to cover many other sorts of values.

Closures exist to store data. But the run-time system also needs operational information about each closure: how to garbage-collect it, how to force its evaluation, etc. This information is known at compile time and is shared between many closures. All algebraic values with the same constructor will share this information, as will all function values created from the same lambda in the program's source.

So each closure stores a pointer to an info table, holding this operational information. Info tables are generated at compile time, and stored as part of an executable's read-only data section. This means that they have statically-known addresses, with associated names in the executable's symbol table. We'll use these symbol names to name our functions.

We can dump an executable's symbol table with `nm`:

``````\$ nm -f posix foo
...
ghczmprim_GHCziBool_Bool_closure_tbl D 0000000000749978
ghczmprim_GHCziBool_False_closure D 0000000000749970
ghczmprim_GHCziBool_False_static_info T 00000000004e4dd8
ghczmprim_GHCziBool_True_closure D 0000000000749990
ghczmprim_GHCziBool_True_static_info T 00000000004e4d80
ghczmprim_GHCziDebug_debugErrLn1_closure D 00000000007499a0
ghczmprim_GHCziDebug_debugErrLn1_info T 00000000004e4ec0
...
``````

Haskell identifiers can contain characters not allowed in symbol names. GHC uses a name-mangling scheme to build symbol names. For example, the first symbol above decodes to

``````ghc-prim_GHC.Bool_Bool_closure_tbl

``````

We'll use vacuum to inspect heap objects. vacuum can do a lot of cool things, like visualize heap structures from a running Haskell program. We'll only use one of vacuum's functions here.

We'll also use the GHC API to un-mangle symbol names. GHC is a 20-year effort that has evolved alongside the Haskell language. It follows some legacy conventions like a mostly-flat module hierarchy. So the module we need is named simply `Encoding`.

``import Data.Wordimport Data.Maybeimport Text.Printfimport Control.Parallel ( pseq )import qualified Data.Map           as Mapimport qualified System.Posix.Files as Posiximport qualified System.Process     as Procimport qualified Foreign.Ptr        as Ptrimport qualified GHC.Vacuum         as Vacimport qualified Encoding           as GHC``

We invoke `nm` as a subprocess and parse its output:

``type Symbols = Map.Map Word StringgetSymbols :: IO SymbolsgetSymbols = do  exe <- Posix.readSymbolicLink "/proc/self/exe"  out <- Proc.readProcess "nm" ["-f", "posix", exe] ""  let offset = 0x10  let f (sym:_:addr:_) = Just (read ("0x"++addr) - offset, GHC.zDecodeString sym)      f _ = Nothing  return . Map.fromList . catMaybes . map (f . words) . lines \$ out``

We're using the Linux `proc` filesystem to get a symbolic link to our application's executable.

The symbols in memory appear at an address `0x10` = 16 bytes or 2 machine words lower than in the executable's symbol table. I'm not sure why; perhaps it's because of GHC's "tables next to code" optimization.

Resolving a symbol

Once we have the symbol table, looking up a value is relatively easy:

``name :: Symbols -> a -> Stringname syms x = fromMaybe unk \$ Map.lookup ptr syms where  ptr = x `pseq` (fromIntegral . Ptr.ptrToWordPtr . Vac.getInfoPtr \$ x)  unk = printf "<unknown info table at 0x%016x>" ptr``

We use vacuum to get the value's info table pointer, convert this to a `Word`, then look it up in the symbol table.

We explicitly evaluate `x` with `pseq`, to avoid seeing a thunk.

Testing it

We'll test with

``````\$ ghc --make name.hs -package ghc
\$ ./name
``````

Each test below is commented with the expected output. First, let's try a few non-function values:

``main :: IO ()main = do  syms <- getSymbols  let test = putStrLn . name syms  test 3            -- integer-gmp_GHC.Integer.Type_S#_con_info  test (3 :: Int)   -- ghc-prim_GHC.Types_I#_static_info  test "xyz"        -- ghc-prim_GHC.Types_:_con_info``

GHC defaults to `3 :: Integer`, as `-Wall` will tell you. As we see, `Integer` and `Int` are both implemented as algebraic data:

``data Integer  = S# Int#  | J# Int# ByteArray#data Int = I# Int#``

The string `"xyz"` is a list built out of `(:)` cells.

Next let's try a few functions:

``  test map          -- base_GHC.Base_map_info  test getChar      -- base_System.IO_getChar_info  test (+)          -- integer-gmp_GHC.Integer_plusInteger_info``

Not bad! `(+)` defaults to operating on `Integer`, and GHC inlines the type class dictionary, giving us the underlying `plusInteger` function.

Now let's see the limits of this technique:

``  test (\_ -> 'x')  -- s1jD_info  test (const 'x')  -- stg_PAP_info  test test         -- stg_PAP_info``

Our lambda expression gets a useless compiler-generated name. The application of `const` is worse; it uses an info table common to all partial applications. However, we could use vacuum to follow the fields of the `PAP` closure, which I'll leave as an exercise to the reader. ;)

`test` itself is also a partial application. It's defined by applying two arguments to the function `(.)` defined as

``(.) f g x = f (g x)``

If we eta-expand `test`:

``  let test x = putStrLn \$ name syms x``

then we'll get another compiler-generated name like `s1mh_info`.

Tuesday, October 26, 2010

Tour of a real toy Haskell program, part 2

This is part 2 of my commentary on detrospector. You can find part 1 here.

Performance claims

I make a lot of claims about performance below, not all of which are supported by evidence. I did perform a fair amount of profiling, but it's impossible to test every combination of data structures. How can I structure my code to make more of my performance hypotheses easily testable?

Representing the source document

When we process a source document, we'll need a data structure to represent large amounts of text.

• Haskell's standard `String` type is simple, maximally lazy, and Unicode-correct. However, as a singly-linked list of individually boxed heap-allocated characters, it's woefully inefficient in space and time.

• `ByteString` is a common alternative, but solves the wrong problem. We need a sequence of Unicode codepoints, not a sequence of bytes.

• `ByteString.Char8` is a hack and not remotely Unicode-correct.

• The text library stores Unicode characters in a `ByteString`-like packed format. That's exactly what we need.

We choose the lazy flavor of text so that we can stream the input file in chunks without storing it all in memory. Lazy IO is semantically dubious, but I've decided that this project is enough of a toy that I don't care. Note that robustness to IO errors is not in the requirements list. ;)

The enumerator package provides iteratees as a composable, well-founded alternative to the lazy IO hack. The enumerator API was complex enough to put me off using it for the first version of detrospector. After reading Michael Snoyman's enumerators tutorial, I have a slight idea how the library works, and I might use it in a future version.

Representing substrings

We also need to represent the k-character substrings, both when we analyze the source and when we generate text. The requirements here are different.

We expect that k will be small, certainly under 10 — otherwise, you'll just generate the source document! With many small strings, the penalty of boxing each individual `Char` is less severe.

And we need to support queue-like operations; we build a new substring by pushing a character onto the end, and dropping one from the beginning. For both `String` and `Text`, appending onto the end requires a full copy. So we'll use `Data.Sequence`, which provides sequences with bidirectional append, implemented as finger trees:

``-- module Detrospector.Typestype Queue a = S.Seq a``

...Actually, I just ran a quick test with `Text` as the queue type, and it seems not to matter much. Since k is small, the O(k) copy is insignificant. Profiling makes fools of us all, and asymptotic analysis is mostly misused. Anyway, I'm keeping the code the way it is, pending any insight from my clever readers. Perhaps one of the queues from Purely Functional Data Structures would be most appropriate.

Representing character counts

We need a data structure for tabulating character counts. In the old days of C and Eurocentrism, we could use a flat, mutable `int` indexed by ASCII values. But the range of Unicode characters is far too large for a flat array, and we need efficient functional updates without a full copy.

We could import `Data.Map` and use `Map Char Int`. This will build a balanced binary search tree using pairwise `Ord` comparisons.

But we can do better. We can use the bits of an integer key as a path in a tree, following a left or right child for a `0` or `1` bit, respectively. This sort of search tree (a trie) will typically outperform repeated pairwise comparisons. `Data.IntMap` implements this idea, with an API very close to `Map Int`. Our keys are `Char`s, but we can easily convert using `fromIntegral`.

``-- module Detrospector.Typesimport qualified Data.IntMap as IM...type FreqTable = IM.IntMap Int``

Representing distributions

So we have some frequency table like

``IM.fromList [('e', 267), ('t', 253), ('a', 219), ...]``

How can we efficiently pick a character from this distribution? We're mapping characters to individual counts, but we really want a map from cumulative counts to characters:

``-- module Detrospector.Typesdata PickTable = PickTable Int (IM.IntMap Char)``

To sample a character from `PickTable t im`, we first pick a random `k` such that 0 ≤ `k` < `t`, using a uniform distribution. We then find the first key in `im` which is greater than `k`, and take its associated `Char` value. In code:

``-- module Detrospector.Typesimport qualified System.Random.MWC as RNG...sample :: PickTable -> RNG.GenIO -> IO Charsample (PickTable t im) g = do  k <- (`mod` t) <\$> RNG.uniform g  case IM.split k im of    (_, IM.toList -> ((_,x):_)) -> return x    _ -> error "impossible"``

The largest cumulative sum is the total count `t`, so the largest key in `im` is `t`. We know `k` < `t`, so `IM.split k im` will never return an empty map on the right side.

Note the view pattern for pattern-matching an `IntMap` as if it were an ascending-order list.

The standard `System.Random` library in GHC Haskell is quite slow, a problem shared by most language implementations. We use the much faster mwc-random package. The only operation we need is picking a uniform `Int` as an `IO` action.

We still need a way to build our `PickTable`:

``-- module Detrospector.Typesimport Data.List ( mapAccumR )...cumulate :: FreqTable -> PickTablecumulate t = PickTable r \$ IM.fromList ps where  (r,ps) = mapAccumR f 0 \$ IM.assocs t  f ra (x,n) = let rb = ra+n in (rb, (rb, toEnum x))``

This code is short, but kind of tricky. For reference:

``mapAccumR :: (acc -> x -> (acc, y)) -> acc -> [x] -> (acc, [y])f :: Int -> (Int, Int) -> (Int, (Int, Char))``

`f` takes an assoc pair from the `FreqTable`, adds its count to the running sum, and produces an assoc pair for the `PickTable`. We start the traversal with a sum of `0`, and get the final sum `r` along with our assoc pairs `ps`.

Representing the Markov chain

So we can represent probability distributions for characters. Now we need a map from k-character substrings to distributions.

`Data.Map` is again an option, but pairwise, character-wise comparison of our `Queue Char` values will be slow. What we really want is another trie, with character-based fanout at each node. Hackage has bytestring-trie, which unfortunately works on bytes, not characters. And maybe I should have used TrieMap or list-tries. Instead I used the hashmap package:

``-- module Detrospector.Typesimport qualified Data.HashMap as H...data Chain = Chain Int (H.HashMap (Queue Char) PickTable)``

A value `Chain k hm` maps from subsequences of up to `k` `Char`s to `PickTable`s. A lookup of some `Queue Char` key will require one traversal to calculate an `Int` hash value, then uses an `IntMap` to find a (hopefully small) `Map` of keys with that same hash value.

There is a wrinkle: we need to specify how to hash a `Queue`, which is just a synonym for `S.Seq`. This is an orphan instance, which we could avoid by `newtype`-wrapping `S.Seq`.

``-- module Detrospector.Typesimport qualified Data.Hashable as Himport qualified Data.Foldable as F...instance (H.Hashable a) => H.Hashable (S.Seq a) where  {-# SPECIALIZE instance H.Hashable (S.Seq Char) #-}  hash = F.foldl' (\acc h -> acc `H.combine` H.hash h) 0``

This code is lifted almost verbatim from the `[a]` instance in `Data.Hashable`.

Serialization

After training, we need to write the `Chain` to disk, for use in subsequent generation. I started out with derived `Show` and `Read`, which was simple but incredibly slow. We'll use binary with `ByteString.Lazy` — the dreaded lazy IO again!

We start by specifying how to serialize a few types. Here the tuple instances for `Binary` come in handy:

``-- module Detrospector.Typesimport qualified Data.Binary as Bin...-- another orphan instanceinstance (Bin.Binary k, Bin.Binary v, H.Hashable k, Ord k)       => Bin.Binary (H.HashMap k v) where  put = Bin.put . H.assocs  get = H.fromList <\$> Bin.getinstance Bin.Binary PickTable where  put (PickTable n t) = Bin.put (n,t)  get = uncurry PickTable <\$> Bin.getinstance Bin.Binary Chain where  put (Chain n h) = Bin.put (n,h)  get = uncurry Chain <\$> Bin.get``

The actual IO is easy. We use gzip compression, which fits right into the IO pipeline:

``-- module Detrospector.Typesimport qualified Data.ByteString.Lazy   as BSLimport qualified Codec.Compression.GZip as Z...withChain :: FilePath -> (Chain -> RNG -> IO a) -> IO awithChain p f = do  ch <- (Bin.decode . Z.decompress) <\$> BSL.readFile p  RNG.withSystemRandom \$ f chwriteChain :: FilePath -> Chain -> IO ()writeChain out = BSL.writeFile out . Z.compress . Bin.encode``

Training the chain

I won't present the whole implementation of the `train` subcommand, but here's a simplification:

``-- module Detrospector.Modes.Trainimport qualified Data.Text     as Txtimport qualified Data.Text.IO  as Txtimport qualified Data.HashMap  as Himport qualified Data.IntMap   as IMimport qualified Data.Sequence as Simport qualified Data.Foldable as F...train Train{num,out} = do  (_,h) <- Txt.foldl' roll (emptyQ, H.empty) ys <\$> Txt.getContents  writeChain out . Chain num \$ H.map cumulate h where  roll (!s,!h) x    = (shift num x s, F.foldr (H.alter \$ ins x) h \$ S.tails s)   ins x Nothing  = Just \$! sing x  ins x (Just v) = Just \$! incr x v   sing x = IM.singleton (fromEnum x) 1   incr x = IM.alter f \$ fromEnum x where    f Nothing  = Just 1    f (Just v) = Just \$! (v+1)``

Before generating `PickTable`s, we build a `HashMap` of `FreqTable`s. We fold over the input text, accumulating a pair of (last characters seen, map so far). Since `foldl'` is only strict to weak head-normal form (WHNF), we use bang patterns on the fold function `roll` to force further evaluation. RWH discusses the same issue.

`shift` (from `Detrospector.Types`) pushes a new character into the queue, and drops the oldest character if the size exceeds `num`. We add one count for the new character `x`, both for the whole history `s` and each of its suffixes.

We're using `alter` where perhaps a combination of `lookup` and `insert` would be more natural. This is a workaround for a subtle laziness-related space leak, which I found after much profiling and random mucking about. When you insert into a map like so:

``let mm = insert k (v+1) m``

there is nothing to force `v+1` to WHNF, even if you force `mm` to WHNF. The leaves of our tree end up holding large thunks of the form `((((1+1)+1)+1)+...)`.

The workaround is to call `alter` with `Just \$! (v+1)`. We know that the implementation of `alter` will pattern-match on the `Maybe` constructor, which then triggers WHNF evaluation of `v+1` because of `(\$!)`. This was tricky to figure out. Is there a better solution, or some different way I should approach this problem? It seems to me that `Data.Map` and friends are generally lacking in strictness building blocks.

The end!

Thanks for reading! Here's an example of how not to write the same program:

``module Main where{import Random;(0:y)%(R p _)=y%p;(1:f)%(R _ n)=f%n;[]%(J x)=x;b[p,v,k]=(l k%).(l v%).(l p%);main=getContents>>=("eek"#).flip(.:"eek")(y.y.y.y\$0);h k=toEnum k;;(r:j).:[k,v,b]=(j.:[v,b,r]).((k?).(v?).(b?)\$(r?)m);[].:_=id;(!)=iterate;data R a=J a|R(R a)(R a);(&)i=fmap i;k b y v j=let{t=l b%v+y;d f|t>j=b;df=k(m b)t v j}in d q;y l=(!!8)(q R!J l);q(+)b=b+b;p(0:v)z(R f x)=R(p v z f)x;p[]z(J x)=J(z x);p(1:v)z(R n k)=R n\$p v z k;m = succ;l p=tail.(snd&).take 9\$((<<2).fst)!(fromEnum p,0);(?)=p.l;d@[q,p,r]#v=let{u=b d v;j=i u;(s,o)|j<1=((97,122),id)|h 1=((0,j-1),(k 0 0 u))}in do{q<-(h.o)&randomRIO s;putChar q;[p,r,q]#v};i(J q)=q;i(R n k)=i n+i k;(<<)=divMod} -- finite text on stdin  © keegan oct 2010 BSD3``

Tour of a real toy Haskell program, part 1

Haskell suffers from a problem I'll call the Fibonacci Gap. Many beginners start out with a bunch of small mathematical exercises, develop that skillset, and then are at a loss for what to study next. Often they'll ask in `#haskell` for an example of a "real Haskell program" to study. Typical responses include the examples in Real World Haskell, or the Design and Implementation of XMonad talk.

This is my attempt to provide another data point: a commentary on detrospector, my text-generating toy. It's perhaps between the RWH examples and xmonad in terms of size and complexity. It's not a large program, and certainly not very useful, but it does involve a variety of real-world concerns such as Unicode processing, choice of data structures, strictness for performance, command-line arguments, serialization, etc. It also illustrates my particular coding style, which I don't claim is the One True Way, but which has served me well in a variety of Haskell projects over the past five years.

Of course, I imagine that experts may disagree with some of my decisions, and I welcome any and all feedback.

I haven't put hyperlinked source online yet, but you can grab the tarball and follow along.

This is part 1, covering style and high-level design. Part 2 addresses more details of algorithms, data structures, performance, etc.

The algorithm

detrospector generates random text conforming to the general style and diction of a given source document, using a quite common algorithm.

First, pick a "window size" k. We consider all k-character contiguous substrings of the source document. For each substring w, we want to know the probability distribution of the next character. In other words, we want to compute values for

P(next char is x | last k chars were w).

To compute these, we scan through the source document, remembering the last k characters as w. We build a table of next-character occurrence counts for each substring w.

These tables form a Markov chain, and we can generate random text by a random walk in this chain. If the last k characters we generated were w, we choose the next character randomly according to the observed distribution for w.

So we can train it on Accelerando and get output like:

addressed to tell using back holes everyone third of the glances waves and diverging him into the habitat. Far beyond Neptune, I be?" asks, grimy. Who's whether is headquarters I need a Frenchwoman's boyfriend go place surrected in the whole political-looking on the room, leaving it, beam, the wonderstood. The Chromosome of upload, does enough. If this one of the Vile catches agree."

Design requirements

We can only understand a program's design in light of the problem it was meant to solve. Here's my informal list of requirements:

• detrospector should generate random text according to the above algorithm.

• We should be able to invoke the text generator many times without re-analyzing the source text each time.

• detrospector should handle all Unicode characters, using the character encoding specified by the system's locale.

• detrospector should be fast without sacrificing clarity, whatever that means.

Wearing my customer hat, these are axioms without justification. Wearing my implementor hat, I will have to justify design decisions in these terms.

Style

In general I import modules `qualified`, using `as` to provide a short local name. I make an exception for other modules in the same project, and for "standard modules". I don't have a precise definition of "standard module" but it includes things like `Data.Maybe`, `Control.Applicative`, etc.

The longest line in detrospector is 70 characters. There is no hard limit, but more than about 90 is suspect.

Indentation is two spaces. Tabs are absolutely forbidden. I don't let the indentation of a block construct depend on the length of a name, thus:

``foo x = do  y <- bar x  baz y``

and

``bar x =  let y = baz x      z = quux x y  in  y z``

This avoids absurd left margins, looks more uniform, and is easier to edit.

I usually write delimited syntax with one item per line, with the delimiter prefixed:

``{-# LANGUAGE    ViewPatterns  , PatternGuards #-}``

and

``data Mode  = Train { num   :: Int          , out   :: FilePath }  | Run   { chain :: FilePath }``

Overriding layout is sometimes useful, e.g.:

``look = do x <- H.lookup t h; return (x, S.length t)``

(With `-XTupleSections` we could write

``look = (,S.length t) <\$> H.lookup t h``

but that's just gratuitous.)

I always write type signatures on top-level bindings, but rarely elsewhere.

Module structure

I started out with a single file, which quickly became unmanageable. The current module layout is:

``````Detrospector.
Types      types and functions used throughout
Modes      a type with one constructor per mode
Modes.
Train    train the Markov chain
Run      generate random text
Neolog   generate neologisms
Main       command-line parsing
``````

There is also a `Main` module in `detrospector.hs`, which simply invokes `Detrospector.Main.main`.

Modules I write tend to fall into two categories: those which export nearly everything, and those which export only one or two things. The former includes "utility" modules with lots of small types and function definitions. The latter includes modules providing a specific piece of functionality. A parsing module might define three dozen parsers internally, but will only export the root of the grammar.

An abstract data type might fall into a third category, since they can export a large API yet have lots of internal helpers. But I don't write such modules very often.

`Detrospector.Types` is in the first category. Most Haskell projects will have a `Types` module, although I'm somewhat disappointed that I let this one become a general grab-bag of types and utility functions.

The rest fall into the second category. Each module in `Detrospector.Modes.*` exports one function to handle that mode. `Detrospector.Main` exports only `main`.

Build setup

This was actually my first Cabal project, and the first thing I uploaded to Hackage. I think Cabal is great, and features like package-visibility management are useful even for small local projects.

In my `cabal` file I set `ghc-options: -Wall`, which enables many helpful warnings. The project should build with no warnings, but I use the `OPTIONS_GHC` pragma to disable specific warnings in specific files, where necessary.

I also run HLint on my code periodically, but I don't have it integrated with Cabal.

I was originally passing `-O2` to `ghc`. Cabal complained that it's probably not necessary, which was correct. The Cabal default of `-O` performs just as well.

I'm using Git for source control, which is neither here nor there.

Command-line parsing

detrospector currently has three modes, as listed above. I wanted to use the "subcommand" model of `git`, `cabal`, etc. So we have `detrospector train`, `detrospector run`, etc. The cmdargs package handles this argument style with a low level of boilerplate.

The "impure" interface to cmdargs uses some dark magic in the operator `&=` in order to attach annotations to arbitrary record fields. The various caveats made me uneasy, so I opted for the slightly more verbose "pure" interface, which looks like this:

``-- module Detrospector.Mainimport qualified System.Console.CmdArgs as Argimport           System.Console.CmdArgs((+=),Annotate((:=)))...modes  = Arg.modes_  [train,run,neolog]      += Arg.program "detrospector"      += Arg.summary "detrospector: Markov chain text generator"      += Arg.help    "Build and run Markov chains for text generation"  where  train = Arg.record Train{}    [ num := 4          += Arg.help "Number of characters lookback"    , out := error "Must specify output chain"          += Arg.typFile          += Arg.help "Write chain to this file" ]    += Arg.help "Train a Markov chain from standard input"  run = Arg.record Run{}    [ chain := error "Must specify input chain"            += Arg.argPos 0                     += Arg.typ "CHAIN_FILE" ]    += Arg.help "Generate random text"  ...``

This tells cmdargs how to construct values of my record type `Detrospector.Modes.Mode`. We get help output for free:

``````\$ ./dist/build/detrospector/detrospector -?
detrospector: Markov chain text generator

detrospector [COMMAND] ... [OPTIONS]
Build and run Markov chains for text generation

Common flags:
-? --help        Display help message
-V --version     Print version information

detrospector train [OPTIONS]
Train a Markov chain from standard input

-n --num=INT     Number of characters lookback
-o --out=FILE    Write chain to this file

detrospector run [OPTIONS] CHAIN_FILE
Generate random text

...
``````

My use of `error` here is hacky and leads to a bug that I recently discovered. When the `-o` argument to `train` is invalid or missing, the error is not printed until the (potentially time-consuming) analysis is completed. Only then is the record's field forced.

To be continued...

...right this way.