#haskell IRC channel on Freenode has archived logs going back to late 2001. That's a lot of data: over 10 million lines, almost 700 MB uncompressed. There's all kinds of clever analyses we could do, and I think I know what language we'd like to use.
So I wrote clogparse, a library for parsing these log files. Let's see who
lambdabot has attacked this year:
import System.Environment ( getArgs )
import qualified Data.Text as Text
import qualified Data.Text.IO as Text
main :: IO ()
main = getArgs >>= mapM_ (\c -> do
es <- parseLog haskellConfig c
let m = Nick $ Text.pack "lambdabot"
mapM_ Text.putStrLn [ t | EventAt _ (Act n t) <- es, n == m ])
$ ghc --make -O -Wall test.hs [1 of 1] Compiling Main ( test.hs, test.o ) Linking test ... $ ./test logs/10.* +RTS -A400M hits lunabot with an assortment of kitchen utensils will count to five... would never hurt Cale! pulls copumpkin through the Evil Mangler throws some pointy lambdas at medfly slaps Jafet puts on her slapping gloves, and slaps cads hits Twey with an assortment of kitchen utensils pats Gracenotes on the head pushes c_wraith for using fromJust from his chair ...
This is allocation-heavy code. I increased the allocation area size with
+RTS -A400M, which significantly improves performance.
Of course it's hard to beat
grep for something so simple. Let's next look at when people join the channel throughout the day:
import qualified Data.IntMap as M
-- Map times to 10-minute buckets.
bucket :: UTCTime -> Int
bucket (UTCTime _ dt) = floor (toRational dt / 600)
unbucket :: Int -> String
unbucket n = printf "%02d:%02d" d (m*10) where
(d,m) = divMod n 6
-- For use with 'alter'; increments an IntMap key strictly.
inc :: (Num a) => a -> Maybe a -> Maybe a
inc x Nothing = Just x
inc x (Just n) = Just $! (n+x)
-- Get a count of 'Join' events per time bucket for one file.
get :: FilePath -> IO (M.IntMap Int)
get p = do
es <- parseLog haskellConfig p
let bs = [ bucket t | EventAt t (Join _ _) <- es ]
return $! foldl' (flip . M.alter $ inc 1) M.empty bs
main :: IO ()
main = do
p <- pool 8 -- to avoid exhausting memory
ms <- getArgs >>= mapM (spawn . p . get) >>= sequence
-- IntMap lacks a strict 'unionWith', so fold over assocs.
let acc = foldl' (\h (k,v) -> M.alter (inc v) k h) M.empty
m = acc $ concatMap M.assocs ms
-- print (time, count) in gnuplot-friendly format
mapM_ (uncurry $ printf "%s %8d\n")
[ (unbucket k, v) | (k,v) <- M.assocs m ]
We count how many join events occur in each ten-minute interval of the day. We parse in a separate thread for each file, then consolidate the results. Spawning more than 3,000 threads is normally not a problem, but we don't want to load every log file into memory at once. We'd like to have the simplicity of one thread per file, but keep some of them blocked waiting for others to finish. That's exactly what the new
pool function in spawn 0.2 does.
Let's see the results:
$ ghc --make -O -Wall -threaded histogram.hs $ ./histogram logs/* +RTS -N2 > hist.dat $ gnuplot gnuplot> set xdata time gnuplot> set timefmt "%H:%M" gnuplot> set format x "%H:%M" gnuplot> plot 'hist.dat' using 1:2 with steps
And we get this graph:
(Modulo some extra
gnuplot tweaking on my part. Click for big.)
Times are labeled in UTC. You can sort of make sense of this data: there's a big peak around 09:10, at the beginning of the day in Europe, and some others later on, maybe from the USA. There's another big peak at 23:00; why? How much of this is due to particular institutions, netsplits, etc?
Anyway, that's a couple of small examples of what this library might be good for. I'm looking forward to seeing some clever applications in the future.