Friday, September 24, 2010

clogparse: parsing #haskell IRC logs

The #haskell IRC channel on Freenode has archived logs going back to late 2001. That's a lot of data: over 10 million lines, almost 700 MB uncompressed. There's all kinds of clever analyses we could do, and I think I know what language we'd like to use.

So I wrote clogparse, a library for parsing these log files. Let's see who lambdabot has attacked this year:

import Data.IRC.CLog.Parse
import System.Environment ( getArgs )
import qualified Data.Text as Text
import qualified Data.Text.IO as Text

main :: IO ()
main = getArgs >>= mapM_ (\c -> do
es <- parseLog haskellConfig c
let m = Nick $ Text.pack "lambdabot"
mapM_ Text.putStrLn [ t | EventAt _ (Act n t) <- es, n == m ])

Running it:

$ ghc --make -O -Wall test.hs
[1 of 1] Compiling Main             ( test.hs, test.o )
Linking test ...

$ ./test logs/10.* +RTS -A400M
hits lunabot with an assortment of kitchen utensils
will count to five...
would never hurt Cale!
pulls copumpkin through the Evil Mangler
throws some pointy lambdas at medfly
slaps Jafet
puts on her slapping gloves, and slaps cads
hits Twey with an assortment of kitchen utensils
pats Gracenotes on the head
pushes c_wraith for using fromJust from his chair
...

This is allocation-heavy code. I increased the allocation area size with +RTS -A400M, which significantly improves performance.

Of course it's hard to beat grep for something so simple. Let's next look at when people join the channel throughout the day:

import Data.IRC.CLog.Parse
import System.Environment
import Data.List
import Data.Time
import Text.Printf
import Control.Concurrent.Spawn
import qualified Data.IntMap as M

-- Map times to 10-minute buckets.
bucket :: UTCTime -> Int
bucket (UTCTime _ dt) = floor (toRational dt / 600)

unbucket :: Int -> String
unbucket n = printf "%02d:%02d" d (m*10) where
(d,m) = divMod n 6

-- For use with 'alter'; increments an IntMap key strictly.
inc :: (Num a) => a -> Maybe a -> Maybe a
inc x Nothing = Just x
inc x (Just n) = Just $! (n+x)

-- Get a count of 'Join' events per time bucket for one file.
get :: FilePath -> IO (M.IntMap Int)
get p = do
es <- parseLog haskellConfig p
let bs = [ bucket t | EventAt t (Join _ _) <- es ]
return $! foldl' (flip . M.alter $ inc 1) M.empty bs

main :: IO ()
main = do
p <- pool 8 -- to avoid exhausting memory
ms <- getArgs >>= mapM (spawn . p . get) >>= sequence
-- IntMap lacks a strict 'unionWith', so fold over assocs.
let acc = foldl' (\h (k,v) -> M.alter (inc v) k h) M.empty
m = acc $ concatMap M.assocs ms
-- print (time, count) in gnuplot-friendly format
mapM_ (uncurry $ printf "%s %8d\n")
[ (unbucket k, v) | (k,v) <- M.assocs m ]

We count how many join events occur in each ten-minute interval of the day. We parse in a separate thread for each file, then consolidate the results. Spawning more than 3,000 threads is normally not a problem, but we don't want to load every log file into memory at once. We'd like to have the simplicity of one thread per file, but keep some of them blocked waiting for others to finish. That's exactly what the new pool function in spawn 0.2 does.

Let's see the results:

$ ghc --make -O -Wall -threaded histogram.hs
$ ./histogram logs/* +RTS -N2 > hist.dat
$ gnuplot
gnuplot> set xdata time
gnuplot> set timefmt "%H:%M"
gnuplot> set format x "%H:%M"
gnuplot> plot 'hist.dat' using 1:2 with steps

And we get this graph:

(Modulo some extra gnuplot tweaking on my part. Click for big.)

Times are labeled in UTC. You can sort of make sense of this data: there's a big peak around 09:10, at the beginning of the day in Europe, and some others later on, maybe from the USA. There's another big peak at 23:00; why? How much of this is due to particular institutions, netsplits, etc?

Anyway, that's a couple of small examples of what this library might be good for. I'm looking forward to seeing some clever applications in the future.

11 comments:

  1. Hey neat, I like the histogram that shows when people login. Very cool to know.

    ReplyDelete
  2. Hi Keegan, nice work, I needed something like this for hpaste.org so that I can provide the 'context in IRC' feature that paste.lisp.org used to sport. I was actually going to just import the whole thing into a PostgreSQL database and then query the database. I think I can use your library to do the initial big import and updates. In the end I will extend hpaste.org slightly to have an IRC tab which will allow you to browse and query the IRC logs. Thanks, I really didn't feel like writing a log parser!

    A while ago I parsed those same 700mb logs into a hisotgram to make this! http://www.haskell.org/sitewiki/images/1/17/Haskell-wordle-irc.png

    I wonder what other fun stuff you could come up with. Maybe a pisg kind of thing?

    ReplyDelete
  3. Thanks for sharing.I found a lot of interesting information here. A really good post, very thankful and hopeful that you will write many more posts like this one.
    https://www.ucbrowser.vip/
    https://shareit.onl/
    https://mxplayer.pro/

    ReplyDelete


  4. Thankyou for the valuable information.iam very interested with this one.
    looking forward for more like this.

    ac Market
    ac market downloading
    ac market for android
    ac market ios
    ac market for pc

    ReplyDelete
    Replies
    1. Examine this blog post https://philosophyessay.org/category/all-essay-writing-tips/ for some good old and brand new essay writing tips. They could be useful

      Delete
  5. Packers and Movers Bangalore - Reliable and Verified Household Shifting Service Providers Give Reasonable ###Packers and Movers Charges. Cheap and Best Office Relocation Compare Quotation for Assurance for Local and Domestic House Shifting and Get estimates today to save upto 20%, ***Read Customer Reviews - @ Local Packers And Movers Bangalore

    ReplyDelete
  6. Rajasthan Board (RBSE) New Syllabus 2021 Pdf – 10th 12th 11th 9th Class: The Board of Secondary Education Rajasthan has changed the syllabus of 9th, 10th, 11th and 12th class for the session 2021-2022 due to COVID-19. The aspirants who are preparing for their annual board examination they will require the Rajasthan Board 10th New Syllabus. Raj 10th Books pdf It will help them to learn and study according to the updated topics of the exam. As we know the board examination plays an important part in the career of any student. The students of the 12th class can download the RBSE 12th New Syllabus 2021.

    ReplyDelete
  7. Jac Board model papers 2022 (mock exam papers) for class 12th students are available for free download in pdf format. These question papers are based on the most recent curriculum and are available on the Jac Board's official website. JAC 12th Model Paper 2022 Jac Board Model Papers 2022, Jac Board 12th Model Papers 2022, Jac Board class 12th Question Papers.

    ReplyDelete