Tuesday, October 26, 2010

Tour of a real toy Haskell program, part 1

Haskell suffers from a problem I'll call the Fibonacci Gap. Many beginners start out with a bunch of small mathematical exercises, develop that skillset, and then are at a loss for what to study next. Often they'll ask in #haskell for an example of a "real Haskell program" to study. Typical responses include the examples in Real World Haskell, or the Design and Implementation of XMonad talk.

This is my attempt to provide another data point: a commentary on detrospector, my text-generating toy. It's perhaps between the RWH examples and xmonad in terms of size and complexity. It's not a large program, and certainly not very useful, but it does involve a variety of real-world concerns such as Unicode processing, choice of data structures, strictness for performance, command-line arguments, serialization, etc. It also illustrates my particular coding style, which I don't claim is the One True Way, but which has served me well in a variety of Haskell projects over the past five years.

Of course, I imagine that experts may disagree with some of my decisions, and I welcome any and all feedback.

I haven't put hyperlinked source online yet, but you can grab the tarball and follow along.

This is part 1, covering style and high-level design. Part 2 addresses more details of algorithms, data structures, performance, etc.

The algorithm

detrospector generates random text conforming to the general style and diction of a given source document, using a quite common algorithm.

First, pick a "window size" k. We consider all k-character contiguous substrings of the source document. For each substring w, we want to know the probability distribution of the next character. In other words, we want to compute values for

P(next char is x | last k chars were w).

To compute these, we scan through the source document, remembering the last k characters as w. We build a table of next-character occurrence counts for each substring w.

These tables form a Markov chain, and we can generate random text by a random walk in this chain. If the last k characters we generated were w, we choose the next character randomly according to the observed distribution for w.

So we can train it on Accelerando and get output like:

addressed to tell using back holes everyone third of the glances waves and diverging him into the habitat. Far beyond Neptune, I be?" asks, grimy. Who's whether is headquarters I need a Frenchwoman's boyfriend go place surrected in the whole political-looking on the room, leaving it, beam, the wonderstood. The Chromosome of upload, does enough. If this one of the Vile catches agree."

Design requirements

We can only understand a program's design in light of the problem it was meant to solve. Here's my informal list of requirements:

  • detrospector should generate random text according to the above algorithm.

  • We should be able to invoke the text generator many times without re-analyzing the source text each time.

  • detrospector should handle all Unicode characters, using the character encoding specified by the system's locale.

  • detrospector should be fast without sacrificing clarity, whatever that means.

Wearing my customer hat, these are axioms without justification. Wearing my implementor hat, I will have to justify design decisions in these terms.

Style

In general I import modules qualified, using as to provide a short local name. I make an exception for other modules in the same project, and for "standard modules". I don't have a precise definition of "standard module" but it includes things like Data.Maybe, Control.Applicative, etc.

The longest line in detrospector is 70 characters. There is no hard limit, but more than about 90 is suspect.

Indentation is two spaces. Tabs are absolutely forbidden. I don't let the indentation of a block construct depend on the length of a name, thus:

foo x = do
y <- bar x
baz y

and

bar x =
let y = baz x
z = quux x y
in y z

This avoids absurd left margins, looks more uniform, and is easier to edit.

I usually write delimited syntax with one item per line, with the delimiter prefixed:

{-# LANGUAGE
ViewPatterns
, PatternGuards #-}

and

data Mode
= Train { num :: Int
, out :: FilePath }
| Run { chain :: FilePath }

Overriding layout is sometimes useful, e.g.:

look = do x <- H.lookup t h; return (x, S.length t)

(With -XTupleSections we could write

look = (,S.length t) <$> H.lookup t h

but that's just gratuitous.)

I always write type signatures on top-level bindings, but rarely elsewhere.

Module structure

I started out with a single file, which quickly became unmanageable. The current module layout is:

Detrospector.
  Types      types and functions used throughout
  Modes      a type with one constructor per mode
  Modes.
    Train    train the Markov chain
    Run      generate random text
    Neolog   generate neologisms
  Main       command-line parsing

There is also a Main module in detrospector.hs, which simply invokes Detrospector.Main.main.

Modules I write tend to fall into two categories: those which export nearly everything, and those which export only one or two things. The former includes "utility" modules with lots of small types and function definitions. The latter includes modules providing a specific piece of functionality. A parsing module might define three dozen parsers internally, but will only export the root of the grammar.

An abstract data type might fall into a third category, since they can export a large API yet have lots of internal helpers. But I don't write such modules very often.

Detrospector.Types is in the first category. Most Haskell projects will have a Types module, although I'm somewhat disappointed that I let this one become a general grab-bag of types and utility functions.

The rest fall into the second category. Each module in Detrospector.Modes.* exports one function to handle that mode. Detrospector.Main exports only main.

Build setup

This was actually my first Cabal project, and the first thing I uploaded to Hackage. I think Cabal is great, and features like package-visibility management are useful even for small local projects.

In my cabal file I set ghc-options: -Wall, which enables many helpful warnings. The project should build with no warnings, but I use the OPTIONS_GHC pragma to disable specific warnings in specific files, where necessary.

I also run HLint on my code periodically, but I don't have it integrated with Cabal.

I was originally passing -O2 to ghc. Cabal complained that it's probably not necessary, which was correct. The Cabal default of -O performs just as well.

I'm using Git for source control, which is neither here nor there.

Command-line parsing

detrospector currently has three modes, as listed above. I wanted to use the "subcommand" model of git, cabal, etc. So we have detrospector train, detrospector run, etc. The cmdargs package handles this argument style with a low level of boilerplate.

The "impure" interface to cmdargs uses some dark magic in the operator &= in order to attach annotations to arbitrary record fields. The various caveats made me uneasy, so I opted for the slightly more verbose "pure" interface, which looks like this:

-- module Detrospector.Main
import qualified System.Console.CmdArgs as Arg
import System.Console.CmdArgs((+=),Annotate((:=)))
...
modes = Arg.modes_ [train,run,neolog]
+= Arg.program "detrospector"
+= Arg.summary "detrospector: Markov chain text generator"
+= Arg.help "Build and run Markov chains for text generation"
where

train = Arg.record Train{}
[ num := 4
+= Arg.help "Number of characters lookback"
, out := error "Must specify output chain"
+= Arg.typFile
+= Arg.help "Write chain to this file" ]
+= Arg.help "Train a Markov chain from standard input"

run = Arg.record Run{}
[ chain := error "Must specify input chain"
+= Arg.argPos 0
+= Arg.typ "CHAIN_FILE" ]
+= Arg.help "Generate random text"
...

This tells cmdargs how to construct values of my record type Detrospector.Modes.Mode. We get help output for free:

$ ./dist/build/detrospector/detrospector -?
detrospector: Markov chain text generator

detrospector [COMMAND] ... [OPTIONS]
  Build and run Markov chains for text generation

Common flags:
  -? --help        Display help message
  -V --version     Print version information

detrospector train [OPTIONS]
  Train a Markov chain from standard input

  -n --num=INT     Number of characters lookback
  -o --out=FILE    Write chain to this file

detrospector run [OPTIONS] CHAIN_FILE
  Generate random text

...

My use of error here is hacky and leads to a bug that I recently discovered. When the -o argument to train is invalid or missing, the error is not printed until the (potentially time-consuming) analysis is completed. Only then is the record's field forced.

To be continued...

...right this way.

No comments:

Post a Comment