Thursday, November 4, 2010

Obtaining the name of a function in Haskell

Someone on #haskell asked whether you could recover a function's name from its value, i.e.

GHCi> name id
"id"
GHCi> name map
"map"

This is easy in some languages. But Haskell is not designed to provide this kind of run-time information. We'll need some non-portable hacks. I tested this code with GHC 6.12.1 on amd64 Linux; see below for portability notes.

How it works

Most values in GHC Haskell are represented by a closure. Closures representing functions are common in functional-language compilers, but GHC extends the idea to cover many other sorts of values.

Closures exist to store data. But the run-time system also needs operational information about each closure: how to garbage-collect it, how to force its evaluation, etc. This information is known at compile time and is shared between many closures. All algebraic values with the same constructor will share this information, as will all function values created from the same lambda in the program's source.

So each closure stores a pointer to an info table, holding this operational information. Info tables are generated at compile time, and stored as part of an executable's read-only data section. This means that they have statically-known addresses, with associated names in the executable's symbol table. We'll use these symbol names to name our functions.

We can dump an executable's symbol table with nm:

$ nm -f posix foo
...
ghczmprim_GHCziBool_Bool_closure_tbl D 0000000000749978 
ghczmprim_GHCziBool_False_closure D 0000000000749970 
ghczmprim_GHCziBool_False_static_info T 00000000004e4dd8 
ghczmprim_GHCziBool_True_closure D 0000000000749990 
ghczmprim_GHCziBool_True_static_info T 00000000004e4d80 
ghczmprim_GHCziDebug_debugErrLn1_closure D 00000000007499a0 
ghczmprim_GHCziDebug_debugErrLn1_info T 00000000004e4ec0 
...

Haskell identifiers can contain characters not allowed in symbol names. GHC uses a name-mangling scheme to build symbol names. For example, the first symbol above decodes to

ghc-prim_GHC.Bool_Bool_closure_tbl

Reading the symbols

We'll use vacuum to inspect heap objects. vacuum can do a lot of cool things, like visualize heap structures from a running Haskell program. We'll only use one of vacuum's functions here.

We'll also use the GHC API to un-mangle symbol names. GHC is a 20-year effort that has evolved alongside the Haskell language. It follows some legacy conventions like a mostly-flat module hierarchy. So the module we need is named simply Encoding.

import Data.Word
import Data.Maybe
import Text.Printf
import Control.Parallel ( pseq )

import qualified Data.Map as Map
import qualified System.Posix.Files as Posix
import qualified System.Process as Proc
import qualified Foreign.Ptr as Ptr

import qualified GHC.Vacuum as Vac
import qualified Encoding as GHC

We invoke nm as a subprocess and parse its output:

type Symbols = Map.Map Word String

getSymbols :: IO Symbols
getSymbols = do
exe <- Posix.readSymbolicLink "/proc/self/exe"
out <- Proc.readProcess "nm" ["-f", "posix", exe] ""
let offset = 0x10
let f (sym:_:addr:_) = Just (read ("0x"++addr) - offset, GHC.zDecodeString sym)
f _ = Nothing
return . Map.fromList . catMaybes . map (f . words) . lines $ out

We're using the Linux proc filesystem to get a symbolic link to our application's executable.

The symbols in memory appear at an address 0x10 = 16 bytes or 2 machine words lower than in the executable's symbol table. I'm not sure why; perhaps it's because of GHC's "tables next to code" optimization.

Resolving a symbol

Once we have the symbol table, looking up a value is relatively easy:

name :: Symbols -> a -> String
name syms x = fromMaybe unk $ Map.lookup ptr syms where
ptr = x `pseq` (fromIntegral . Ptr.ptrToWordPtr . Vac.getInfoPtr $ x)
unk = printf "<unknown info table at 0x%016x>" ptr

We use vacuum to get the value's info table pointer, convert this to a Word, then look it up in the symbol table.

We explicitly evaluate x with pseq, to avoid seeing a thunk.

Testing it

We'll test with

$ ghc --make name.hs -package ghc
$ ./name

Each test below is commented with the expected output. First, let's try a few non-function values:

main :: IO ()
main = do
syms <- getSymbols
let test = putStrLn . name syms
test 3 -- integer-gmp_GHC.Integer.Type_S#_con_info
test (3 :: Int) -- ghc-prim_GHC.Types_I#_static_info
test "xyz" -- ghc-prim_GHC.Types_:_con_info

GHC defaults to 3 :: Integer, as -Wall will tell you. As we see, Integer and Int are both implemented as algebraic data:

data Integer
= S# Int#
| J# Int# ByteArray#

data Int = I# Int#

The string "xyz" is a list built out of (:) cells.

Next let's try a few functions:

  test map          -- base_GHC.Base_map_info
test getChar -- base_System.IO_getChar_info
test (+) -- integer-gmp_GHC.Integer_plusInteger_info

Not bad! (+) defaults to operating on Integer, and GHC inlines the type class dictionary, giving us the underlying plusInteger function.

Now let's see the limits of this technique:

  test (\_ -> 'x')  -- s1jD_info
test (const 'x') -- stg_PAP_info
test test -- stg_PAP_info

Our lambda expression gets a useless compiler-generated name. The application of const is worse; it uses an info table common to all partial applications. However, we could use vacuum to follow the fields of the PAP closure, which I'll leave as an exercise to the reader. ;)

test itself is also a partial application. It's defined by applying two arguments to the function (.) defined as

(.) f g x = f (g x)

If we eta-expand test:

  let test x = putStrLn $ name syms x

then we'll get another compiler-generated name like s1mh_info.

Notes

This is a hack, and probably not suitable for any serious purpose. Shelling out to nm to get a symbol table is particularly ugly. I tried to use bindings to BFD, but ran into some segfaults.

The above code will work only on 64-bit machines, but could be adapted for 32-bit. I bet the magic offset would change. It works on GHC 6.12.1, and should work on other recent versions, if you can get vacuum to build.

It definitely requires a Unix system, and specifically Linux or something emulating Linux's proc filesystem. You'll need nm from GNU Binutils, which is standard on a system configured for C development.

It won't work if you run strip on your binaries...