Monday, October 10, 2011

shqq: Embedding shell commands in Haskell code

Shell scripts make it easy to pass data between external commands. But shell script as a programming language lacks features like non-trivial data structures and easy, robust concurrency. These would be useful in building quick solutions to system administration and automation problems.

As others have noted,12345 Haskell is an interesting alternative for these scripting tasks. I wrote the shqq library to make it a little easier to invoke external programs from Haskell. With the sh quasiquoter, you write a shell command which embeds Haskell variables, execute it as an IO action, and get the command's standard output as a String. In other words, it's a bit like the backtick operator from Perl or Ruby.

Here's a small example:

$ ghci -XQuasiQuotes
λ> import System.ShQQ
λ> let x = "/proc/uptime" in [sh| sha1sum $x |]
"337ec3fb998fb3a4650a18e0785f0992762b3cda  /proc/uptime\n"

shqq also handles escaping for you, so that the shell will not interpret special characters from Haskell variables. You can override this behavior when desired.

The big caveat is that shqq refuses to build on GHC 7.0 or earlier, due to a Unicode-handling bug in process-1.0. You'll need GHC 7.2 or later.

Finding duplicate files

As an example, here's a program to find duplicate files in the directory tree rooted at the current working directory.

{-# LANGUAGE QuasiQuotes, TupleSections #-}
import Control.Concurrent.Spawn -- package spawn-0.3
import Data.List.Split -- package split-0.1
import System.ShQQ -- package shqq-0.1
import System.Posix.Files
import qualified Data.Map as M

First, the computation itself. If we pair each file with a key, such as size or checksum, we can find the groups of (potentially) duplicate files.

dupes :: (Ord k) => [(FilePath,k)] -> [[FilePath]]
dupes = filter (not . null . drop 1) . M.elems
. foldr (\(v,k) -> M.insertWith (++) k [v]) M.empty

We want to examine files in parallel, but the operating system will complain if we have too many open files. We limit each pass to have at most 256 tests in progress at once.

inParallel :: (a -> IO b) -> [a] -> IO [b]
inParallel f xs = do p <- pool 256; parMapIO (p . f) xs

For efficiency, we find potential duplicates by size, and then checksum only these files. We use external shell commands for checksumming as well as the initial directory traversal. At the end we print the names of duplicated files, one per line, with a blank line after each group of duplicates.

main :: IO ()
main = do
files <- endBy "\0" `fmap` [sh| find -type f -print0 |]

let getSize f = ((f,) . fileSize) `fmap` getFileStatus f
sizeDupes <- dupes `fmap` inParallel getSize files

let getSha f = ((f,) . head . words) `fmap` [sh| sha1sum $f |]
shaDupes <- dupes `fmap` inParallel getSha (concat sizeDupes)

mapM_ (mapM_ putStrLn . (++[""])) shaDupes

And we run it like so:

$ ghc -O -threaded -rtsopts dupe.hs
$ ./dupe +RTS -N

I included type signatures for clarity, but you wouldn't need them in a one-off script. Not counting imports and the LANGUAGE pragma, that makes 10 lines of code total. I'm pretty happy with the expressiveness of this solution, especially the use of parallel IO for an easy speedup.

Thanks to Dylan Lukes for the original idea for this library.

5 comments:

  1. Where are 'pool' and 'parMapIO' defined?

    ReplyDelete
  2. Anacrolix: They're defined in the module Control.Concurrent.Spawn, from the package 'spawn'.

    ReplyDelete
  3. 10 lines? Of much more complex code? Plus more stuff around it?

    That would have taken one line in bash. And be nicer to the eye / simpler to read.
    Something like this would probably do the trick:
    find -type f | while read F; do echo "$(sha1sum "$F" | cut -c 1-40) $(stat -c %s "$F") $F"; done | uniq -d -f 3 -c # -c is optional
    (Files with backslashes in their names won’t work.)

    Sure, it won’t be as fast to run, but it’s just a one-off thing, and the hard disks are the speed bottleneck anyway, so the difference will be negligible.

    I love me some Haskell, but *you’re doing it wrong*!

    ReplyDelete
  4. BAReFOOt:

    You seem to be missing the point. The article is about the shqq library, not about finding duplicate files. That's what we in the business like to call "an example".

    ReplyDelete
  5. Cool, the only change needed to run on OSX was to add a '.' argument after 'find'. Thanks for sharing!

    ReplyDelete