Shell scripts make it easy to pass data between external commands. But shell script as a programming language lacks features like non-trivial data structures and easy, robust concurrency. These would be useful in building quick solutions to system administration and automation problems.
As others have noted,12345 Haskell is an interesting alternative for these scripting tasks. I wrote the shqq
library to make it a little easier to invoke external programs from Haskell. With the sh
quasiquoter, you write a shell command which embeds Haskell variables, execute it as an IO action, and get the command's standard output as a String
. In other words, it's a bit like the backtick operator from Perl or Ruby.
Here's a small example:
$ ghci -XQuasiQuotes
λ> import System.ShQQ
λ> let x = "/proc/uptime" in [sh| sha1sum $x |]
"337ec3fb998fb3a4650a18e0785f0992762b3cda /proc/uptime\n"
shqq
also handles escaping for you, so that the shell will not interpret special characters from Haskell variables. You can override this behavior when desired.
The big caveat is that shqq
refuses to build on GHC 7.0 or earlier, due to a Unicode-handling bug in process-1.0
. You'll need GHC 7.2 or later.
Finding duplicate files
As an example, here's a program to find duplicate files in the directory tree rooted at the current working directory.
{-# LANGUAGE QuasiQuotes, TupleSections #-}
import Control.Concurrent.Spawn -- package spawn-0.3
import Data.List.Split -- package split-0.1
import System.ShQQ -- package shqq-0.1
import System.Posix.Files
import qualified Data.Map as M
First, the computation itself. If we pair each file with a key, such as size or checksum, we can find the groups of (potentially) duplicate files.
dupes :: (Ord k) => [(FilePath,k)] -> [[FilePath]]
dupes = filter (not . null . drop 1) . M.elems
. foldr (\(v,k) -> M.insertWith (++) k [v]) M.empty
We want to examine files in parallel, but the operating system will complain if we have too many open files. We limit each pass to have at most 256 tests in progress at once.
inParallel :: (a -> IO b) -> [a] -> IO [b]
inParallel f xs = do p <- pool 256; parMapIO (p . f) xs
For efficiency, we find potential duplicates by size, and then checksum only these files. We use external shell commands for checksumming as well as the initial directory traversal. At the end we print the names of duplicated files, one per line, with a blank line after each group of duplicates.
main :: IO ()
main = do
files <- endBy "\0" `fmap` [sh| find -type f -print0 |]
let getSize f = ((f,) . fileSize) `fmap` getFileStatus f
sizeDupes <- dupes `fmap` inParallel getSize files
let getSha f = ((f,) . head . words) `fmap` [sh| sha1sum $f |]
shaDupes <- dupes `fmap` inParallel getSha (concat sizeDupes)
mapM_ (mapM_ putStrLn . (++[""])) shaDupes
And we run it like so:
$ ghc -O -threaded -rtsopts dupe.hs
$ ./dupe +RTS -N
I included type signatures for clarity, but you wouldn't need them in a one-off script. Not counting imports and the LANGUAGE
pragma, that makes 10 lines of code total. I'm pretty happy with the expressiveness of this solution, especially the use of parallel IO for an easy speedup.
Thanks to Dylan Lukes for the original idea for this library.
Where are 'pool' and 'parMapIO' defined?
ReplyDeleteAnacrolix: They're defined in the module Control.Concurrent.Spawn, from the package 'spawn'.
ReplyDelete10 lines? Of much more complex code? Plus more stuff around it?
ReplyDeleteThat would have taken one line in bash. And be nicer to the eye / simpler to read.
Something like this would probably do the trick:
find -type f | while read F; do echo "$(sha1sum "$F" | cut -c 1-40) $(stat -c %s "$F") $F"; done | uniq -d -f 3 -c # -c is optional
(Files with backslashes in their names won’t work.)
Sure, it won’t be as fast to run, but it’s just a one-off thing, and the hard disks are the speed bottleneck anyway, so the difference will be negligible.
I love me some Haskell, but *you’re doing it wrong*!
BAReFOOt:
ReplyDeleteYou seem to be missing the point. The article is about the shqq library, not about finding duplicate files. That's what we in the business like to call "an example".
Cool, the only change needed to run on OSX was to add a '.' argument after 'find'. Thanks for sharing!
ReplyDeleteNice article, thanks for the information. It's very complete information. I will bookmark for next reference
ReplyDeletejaring futsal | jaring golf | jaring pengaman proyek |
jaring pengaman bangunan | jaring pengaman gedung
http://www.jual-jaring.blogspot.com/
http://www.agen-jaring.blogspot.com/
http://www.pancasamudera-safetynet.blogspot.com/
http://www.toko-jaring.blogspot.com/
http://www.pusat-jaring.blogspot.com/
http://jualjaringpengaman.blogspot.com/
https://pancasamudera.wordpress.com/
https://pasangjaringfutsal.wordpress.com/
https://jualtambangmurah.wordpress.com/
https://tokojaring.wordpress.com/
https://jualjaringfutsal.wordpress.com/
https://jaringfutsal.wordpress.com/
Thanks for the information
ReplyDeleteShare and Klick Info Blokwalking. Hammer Of Thor Asli
=> Hammer Of Thor Di Bogor
=> Hammer Of Thor Di Medan
=> Hammer Of Thor Di Semarang
=> Hammer Of Thor Di Bandung
=> Hammer Of Thor Di Bandar Lampung
Obat Good Man | Obat pembesar penis | Vimax Asli | Vimax Extender
من المعلوم أن عملية نقل العفش والمتعلقات الشخصية ليست بأمر يسير، فهو شيء بالغ الصعوبة والأهمية، فجميعنا نود حماية وحفظ ممتلكاتنا دون حدوث لها أي شيء قد يترك تأثيره على شكلها وقيمتها، وهنا في مؤسسة نقل عفش بخميس نقوم بمساعدك في تقصي ذلك الشأن دون أي مشكلات ولا تلفيات لأي شيء، فنحن نعمل باستمرارٌ على راحتك وسعادتك عميلنا العزيز.
ReplyDeleteشركة نقل عفش
شركة نقل اثاث من الرياض الى قطر
شركة نقل عفش من الرياض الى قطر
شركة نقل عفش بابها