The first serious projects I coded in Haskell were compilers, code analyzers, and the like. I think it's a domain that really plays to the strengths of the language. So it makes sense that we'd want a top-notch disassembler library for Haskell.
udis86
is a fast, complete, flexible disassembler for x86 and x86-64 / AMD64. It provides a clean C API, which was no trouble to import using hsc2hs
. But any C API is going to feel clunky next to high-level idiomatic Haskell code. So I built two additional layers of wrapping, to make something that will feel like a natural part of your next Haskell app.
You can download my library hdis86
from Hackage, or browse the source at GitHub. By default, a copy of udis86
is embedded in hdis86
. If you already have udis86
installed as a shared library, you can link against that instead.
Getting started
Let's try it out. We'll feed in a ByteString
of machine code, and pretty-print the result with groom
.
$ ghci
λ> :m + Hdis86 Data.ByteString Text.Groom
λ> let code = pack [0xcc, 0xf0, 0xff, 0x44, 0x9e, 0x0f]
λ> Prelude.putStrLn . groom $ disassemble intel64 code
[Inst [] Iint3 [],
Inst [Lock] Iinc
[Mem
(Memory{mSize = Bits32, mBase = Reg64 RSI, mIndex = Reg64 RBX,
mScale = 4, mOffset = Immediate{iSize = Bits8, iValue = 15}})]]
If you're not familiar with x86, you might be surprised by the range of simple and complicated instructions. The first instruction is a humble int3
, which executes a trap to a debugger. It takes no operands and has no prefixes.
The second instruction is an increment of a 32-bit memory region, whose address is computed by summing the value in register RSI
, the value in register RBX
times a scale factor of 4, and a fixed offset of 15. This instruction also has a lock prefix, which changes its concurrency semantics. Some prefixes have a direct meaning like this; others (like Rex
) will change how the disassembler decodes operands.
We can also ask for Intel-style assembly syntax:
λ> let cfg = intel64 { cfgSyntax = SyntaxIntel }
λ> mapM_ (Prelude.putStrLn . mdAssembly) $ disassembleMetadata cfg code
int3
lock inc dword [rsi+rbx*4+0xf]
Analyzing instructions
If we're just printing code, show
ing an Instruction
is needlessly verbose. The real point of the Instruction
type is machine-code analysis, with the help of Haskell's pattern matching capabilities.
Sometimes two fragments of code will differ only in which registers they use. Perhaps two compilers made different choices during register allocation. We'll write a program to detect this.
import Hdis86
import Text.Groom
import Control.Monad
import qualified Data.Map as M
import qualified Data.ByteString as B
When one code fragment uses register X the same way that another fragment uses register Y, we record a constraint X :-> Y
:
data Constraint = Register :-> Register
We compare the two code fragments to produce a list of Constraint
s, or Nothing
if they differ in ways beyond register selection. We'll use this helper function to check a list of Boolean conditions:
(==>) :: [Bool] -> a -> Maybe a
ps ==> x = guard (and ps) >> Just x
We compare operands pairwise. Register operands are easy:
operand :: Operand -> Operand -> Maybe [Constraint]
operand (Reg rx) (Reg ry) = Just [rx :-> ry]
For a memory operand, we need to check that the size, scale, and offset are equal:
-- size, base register, index register, scale, offset
operand (Mem (Memory sx bx ix kx ox)) (Mem (Memory sy by iy ky oy))
= [sx == sy, kx == ky, ox == oy] ==> [bx :-> by, ix :-> iy]
Immediate operands need a similar check for equality, but they don't produce any register constraints:
operand (Ptr px) (Ptr py) = [px == py] ==> []
operand (Imm ix) (Imm iy) = [ix == ix] ==> []
operand (Jump ix) (Jump iy) = [ix == iy] ==> []
operand (Const ix) (Const iy) = [ix == iy] ==> []
In all other cases, we have two different operand constructors, which means they don't match:
operand _ _ = Nothing
To check a pair of instructions, we check their prefixes and opcodes, then check operands pairwise:
inst :: Instruction -> Instruction -> Maybe [Constraint]
inst (Inst px ox rx) (Inst py oy ry) = do
guard $ and [px == py, ox == oy, length rx == length ry]
concat `fmap` zipWithM operand rx ry
Once we have a list of Constraint
s, we need to check it for consistency. If register X maps to Y in one place, it can't map to Z somewhere else:
unify :: [Constraint] -> Maybe (M.Map Register Register)
unify = foldM f M.empty where
f m (rx :-> ry) = case M.lookup rx m of
Nothing -> Just (M.insert rx ry m)
Just ry' -> [ry == ry'] ==> m
Now we put it all together, checking consistency in both directions:
regMap :: [Instruction] -> [Instruction] -> Maybe (M.Map Register Register)
regMap xs ys = do
cs <- concat `fmap` zipWithM inst xs ys
let swap (x :-> y) = (y :-> x)
_ <- unify $ map swap cs
unify cs
main :: IO ()
main = putStrLn . groom $ regMap (f prog_a) (f prog_b) where
f = disassemble intel64
We'll test these two code fragments:
prog_a, prog_b :: B.ByteString
prog_a = B.pack
[ 0x7e, 0x3a -- jle 0x3c
, 0x48, 0x89, 0xf5 -- mov rbp, rsi
, 0xbb, 1, 0, 0, 0 -- mov ebx, 0x1
, 0x48, 0x8b, 0x7d, 0x08 ] -- mov rdi, [rbp+0x8]
prog_b = B.pack
[ 0x7e, 0x3a -- jle 0x3c
, 0x48, 0x89, 0xf3 -- mov rbx, rsi
, 0xbd, 1, 0, 0, 0 -- mov ebp, 0x1
, 0x48, 0x8b, 0x7b, 0x08 ] -- mov rdi, [rbx+0x8]
And the result is:
$ runhaskell regmap.hs
Just
(fromList
[(RegNone, RegNone), (Reg32 RBX, Reg32 RBP),
(Reg64 RBP, Reg64 RBX), (Reg64 RSI, Reg64 RSI),
(Reg64 RDI, Reg64 RDI)])
In reality, this analysis is grossly incomplete. Some instructions have implicit register operands, and some pairs of registers overlap, like EBX
and RBX
. And we have to understand contextual requirements. It's not okay to remap RAX
to RBX
if some calling code is expecting a return value in RAX
.
The complexity of x86 (not to mention the halting problem) means that binary code analysis will never be easy. Hopefully hdis86
can be one of many useful tools in this domain. As always, suggestions or patches are welcome.