Tuesday, October 26, 2010

Tour of a real toy Haskell program, part 2

This is part 2 of my commentary on detrospector. You can find part 1 here.

Performance claims

I make a lot of claims about performance below, not all of which are supported by evidence. I did perform a fair amount of profiling, but it's impossible to test every combination of data structures. How can I structure my code to make more of my performance hypotheses easily testable?

Representing the source document

When we process a source document, we'll need a data structure to represent large amounts of text.

  • Haskell's standard String type is simple, maximally lazy, and Unicode-correct. However, as a singly-linked list of individually boxed heap-allocated characters, it's woefully inefficient in space and time.

  • ByteString is a common alternative, but solves the wrong problem. We need a sequence of Unicode codepoints, not a sequence of bytes.

  • ByteString.Char8 is a hack and not remotely Unicode-correct.

  • The text library stores Unicode characters in a ByteString-like packed format. That's exactly what we need.

For more information on picking a string type, see ezyang's writeup.

We choose the lazy flavor of text so that we can stream the input file in chunks without storing it all in memory. Lazy IO is semantically dubious, but I've decided that this project is enough of a toy that I don't care. Note that robustness to IO errors is not in the requirements list. ;)

The enumerator package provides iteratees as a composable, well-founded alternative to the lazy IO hack. The enumerator API was complex enough to put me off using it for the first version of detrospector. After reading Michael Snoyman's enumerators tutorial, I have a slight idea how the library works, and I might use it in a future version.

Representing substrings

We also need to represent the k-character substrings, both when we analyze the source and when we generate text. The requirements here are different.

We expect that k will be small, certainly under 10 — otherwise, you'll just generate the source document! With many small strings, the penalty of boxing each individual Char is less severe.

And we need to support queue-like operations; we build a new substring by pushing a character onto the end, and dropping one from the beginning. For both String and Text, appending onto the end requires a full copy. So we'll use Data.Sequence, which provides sequences with bidirectional append, implemented as finger trees:

-- module Detrospector.Types
type Queue a = S.Seq a

...Actually, I just ran a quick test with Text as the queue type, and it seems not to matter much. Since k is small, the O(k) copy is insignificant. Profiling makes fools of us all, and asymptotic analysis is mostly misused. Anyway, I'm keeping the code the way it is, pending any insight from my clever readers. Perhaps one of the queues from Purely Functional Data Structures would be most appropriate.

Representing character counts

We need a data structure for tabulating character counts. In the old days of C and Eurocentrism, we could use a flat, mutable int[256] indexed by ASCII values. But the range of Unicode characters is far too large for a flat array, and we need efficient functional updates without a full copy.

We could import Data.Map and use Map Char Int. This will build a balanced binary search tree using pairwise Ord comparisons.

But we can do better. We can use the bits of an integer key as a path in a tree, following a left or right child for a 0 or 1 bit, respectively. This sort of search tree (a trie) will typically outperform repeated pairwise comparisons. Data.IntMap implements this idea, with an API very close to Map Int. Our keys are Chars, but we can easily convert using fromIntegral.

-- module Detrospector.Types
import qualified Data.IntMap as IM
...
type FreqTable = IM.IntMap Int

Representing distributions

So we have some frequency table like

IM.fromList [('e', 267), ('t', 253), ('a', 219), ...]

How can we efficiently pick a character from this distribution? We're mapping characters to individual counts, but we really want a map from cumulative counts to characters:

-- module Detrospector.Types
data PickTable = PickTable Int (IM.IntMap Char)

To sample a character from PickTable t im, we first pick a random k such that 0 ≤ k < t, using a uniform distribution. We then find the first key in im which is greater than k, and take its associated Char value. In code:

-- module Detrospector.Types
import qualified System.Random.MWC as RNG
...
sample :: PickTable -> RNG.GenIO -> IO Char
sample (PickTable t im) g = do
k <- (`mod` t) <$> RNG.uniform g
case IM.split k im of
(_, IM.toList -> ((_,x):_)) -> return x
_ -> error "impossible"

The largest cumulative sum is the total count t, so the largest key in im is t. We know k < t, so IM.split k im will never return an empty map on the right side.

Note the view pattern for pattern-matching an IntMap as if it were an ascending-order list.

The standard System.Random library in GHC Haskell is quite slow, a problem shared by most language implementations. We use the much faster mwc-random package. The only operation we need is picking a uniform Int as an IO action.

We still need a way to build our PickTable:

-- module Detrospector.Types
import Data.List ( mapAccumR )
...
cumulate :: FreqTable -> PickTable
cumulate t = PickTable r $ IM.fromList ps where
(r,ps) = mapAccumR f 0 $ IM.assocs t
f ra (x,n) = let rb = ra+n in (rb, (rb, toEnum x))

This code is short, but kind of tricky. For reference:

mapAccumR :: (acc -> x -> (acc, y)) -> acc -> [x] -> (acc, [y])
f :: Int -> (Int, Int) -> (Int, (Int, Char))

f takes an assoc pair from the FreqTable, adds its count to the running sum, and produces an assoc pair for the PickTable. We start the traversal with a sum of 0, and get the final sum r along with our assoc pairs ps.

Representing the Markov chain

So we can represent probability distributions for characters. Now we need a map from k-character substrings to distributions.

Data.Map is again an option, but pairwise, character-wise comparison of our Queue Char values will be slow. What we really want is another trie, with character-based fanout at each node. Hackage has bytestring-trie, which unfortunately works on bytes, not characters. And maybe I should have used TrieMap or list-tries. Instead I used the hashmap package:

-- module Detrospector.Types
import qualified Data.HashMap as H
...
data Chain = Chain Int (H.HashMap (Queue Char) PickTable)

A value Chain k hm maps from subsequences of up to k Chars to PickTables. A lookup of some Queue Char key will require one traversal to calculate an Int hash value, then uses an IntMap to find a (hopefully small) Map of keys with that same hash value.

There is a wrinkle: we need to specify how to hash a Queue, which is just a synonym for S.Seq. This is an orphan instance, which we could avoid by newtype-wrapping S.Seq.

-- module Detrospector.Types
import qualified Data.Hashable as H
import qualified Data.Foldable as F
...
instance (H.Hashable a) => H.Hashable (S.Seq a) where
{-# SPECIALIZE instance H.Hashable (S.Seq Char) #-}
hash = F.foldl' (\acc h -> acc `H.combine` H.hash h) 0

This code is lifted almost verbatim from the [a] instance in Data.Hashable.

Serialization

After training, we need to write the Chain to disk, for use in subsequent generation. I started out with derived Show and Read, which was simple but incredibly slow. We'll use binary with ByteString.Lazy — the dreaded lazy IO again!

We start by specifying how to serialize a few types. Here the tuple instances for Binary come in handy:

-- module Detrospector.Types
import qualified Data.Binary as Bin
...
-- another orphan instance
instance (Bin.Binary k, Bin.Binary v, H.Hashable k, Ord k)
=> Bin.Binary (H.HashMap k v) where
put = Bin.put . H.assocs
get = H.fromList <$> Bin.get

instance Bin.Binary PickTable where
put (PickTable n t) = Bin.put (n,t)
get = uncurry PickTable <$> Bin.get

instance Bin.Binary Chain where
put (Chain n h) = Bin.put (n,h)
get = uncurry Chain <$> Bin.get

The actual IO is easy. We use gzip compression, which fits right into the IO pipeline:

-- module Detrospector.Types
import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.GZip as Z
...
withChain :: FilePath -> (Chain -> RNG -> IO a) -> IO a
withChain p f = do
ch <- (Bin.decode . Z.decompress) <$> BSL.readFile p
RNG.withSystemRandom $ f ch

writeChain :: FilePath -> Chain -> IO ()
writeChain out = BSL.writeFile out . Z.compress . Bin.encode

Training the chain

I won't present the whole implementation of the train subcommand, but here's a simplification:

-- module Detrospector.Modes.Train
import qualified Data.Text as Txt
import qualified Data.Text.IO as Txt
import qualified Data.HashMap as H
import qualified Data.IntMap as IM
import qualified Data.Sequence as S
import qualified Data.Foldable as F
...
train Train{num,out} = do
(_,h) <- Txt.foldl' roll (emptyQ, H.empty) ys <$> Txt.getContents
writeChain out . Chain num $ H.map cumulate h where

roll (!s,!h) x
= (shift num x s, F.foldr (H.alter $ ins x) h $ S.tails s)

ins x Nothing = Just $! sing x
ins x (Just v) = Just $! incr x v

sing x = IM.singleton (fromEnum x) 1

incr x = IM.alter f $ fromEnum x where
f Nothing = Just 1
f (Just v) = Just $! (v+1)

Before generating PickTables, we build a HashMap of FreqTables. We fold over the input text, accumulating a pair of (last characters seen, map so far). Since foldl' is only strict to weak head-normal form (WHNF), we use bang patterns on the fold function roll to force further evaluation. RWH discusses the same issue.

shift (from Detrospector.Types) pushes a new character into the queue, and drops the oldest character if the size exceeds num. We add one count for the new character x, both for the whole history s and each of its suffixes.

We're using alter where perhaps a combination of lookup and insert would be more natural. This is a workaround for a subtle laziness-related space leak, which I found after much profiling and random mucking about. When you insert into a map like so:

let mm = insert k (v+1) m

there is nothing to force v+1 to WHNF, even if you force mm to WHNF. The leaves of our tree end up holding large thunks of the form ((((1+1)+1)+1)+...).

The workaround is to call alter with Just $! (v+1). We know that the implementation of alter will pattern-match on the Maybe constructor, which then triggers WHNF evaluation of v+1 because of ($!). This was tricky to figure out. Is there a better solution, or some different way I should approach this problem? It seems to me that Data.Map and friends are generally lacking in strictness building blocks.

The end!

Thanks for reading! Here's an example of how not to write the same program:

module Main where{import Random;(0:y)%(R p _)=y%p;(1:f)%(R _ n)=f%n;[]%(J x)=x;b
[p,v,k]=(l k%).(l v%).(l p%);main=getContents>>=("eek"#).flip(.:"eek")(y.y.y.y$0
);h k=toEnum k;;(r:j).:[k,v,b]=(j.:[v,b,r]).((k?).(v?).(b?)$(r?)m);[].:_=id;(!)=
iterate;data R a=J a|R(R a)(R a);(&)i=fmap i;k b y v j=let{t=l b%v+y;d f|t>j=b;d
f=k(m b)t v j}in d q;y l=(!!8)(q R!J l);q(+)b=b+b;p(0:v)z(R f x)=R(p v z f)x;p[]
z(J x)=J(z x);p(1:v)z(R n k)=R n$p v z k;m = succ;l p=tail.(snd&).take 9$((<<2).
fst)!(fromEnum p,0);(?)=p.l;d@[q,p,r]#v=let{u=b d v;j=i u;(s,o)|j<1=((97,122),id
)|h 1=((0,j-1),(k 0 0 u))}in do{q<-(h.o)&randomRIO s;putChar q;[p,r,q]#v};i(J q)
=q;i(R n k)=i n+i k;(<<)=divMod} -- finite text on stdin © keegan oct 2010 BSD3

24 comments:

  1. Nice article, thanks for the information. It's very complete information. I will bookmark for next reference
    jaring futsal | jaring golf | jaring pengaman proyek |
    jaring pengaman bangunan | jaring pengaman gedung
    http://www.jual-jaring.blogspot.com/
    http://www.agen-jaring.blogspot.com/
    http://www.pancasamudera-safetynet.blogspot.com/
    http://www.toko-jaring.blogspot.com/
    http://www.pusat-jaring.blogspot.com/
    http://jualjaringpengaman.blogspot.com/
    https://pancasamudera.wordpress.com/
    https://pasangjaringfutsal.wordpress.com/
    https://jualtambangmurah.wordpress.com/
    https://tokojaring.wordpress.com/
    https://jualjaringfutsal.wordpress.com/
    https://jaringfutsal.wordpress.com/


    ReplyDelete
  2. شركة السيف
    نقل عفش وتخزين اثاث وصيانة منزلية ونظافة داخل مدن السعودية وخارجها شحن دولى من السعودية الى الاردن الإمارات الكويت البحرين لبنان جده مكه الطائف الرياض الدمام

    نقل عفش من جد الى الاردن
    شركة نقل عفش من جدة الى الاردن
    نقل عفش من جدة الى الامارات
    شركة نقل عفش من جدة الى الامارات
    شركات نقل العفش من جدة للاردن
    اجراءات نقل العفش للاردن
    شحن من جدة الى عمان
    شركات النقل البرى من جدة الى الاردن
    شحن من جده للاردن
    شركة شحن من السعودية الى الاردن
    شركة شحن عفش من جدة الى الاردن
    نقل عفش من الرياض الى الامارات
    شركة نقل عفش من الرياض الى الامارات
    نقل عفش جدة
    افضل شركات نقل العفش بجدة
    ارخص شركات نقل العفش بجدة
    شركة نقل عفش جده ومكة
    نقل عفش جدة
    افضل شركات نقل العفش بجدة
    ارخص شركات نقل العفش بجدة
    نقل عفش مكة
    افضل شركات نقل العفش بمكه
    ارخص شركات نقل العفش بمكه
    نقل عفش جدة الحمدانية
    نقل عفش جدة السامر
    نقل عفش جدة ابحر الشمالية
    نقل عفش جدة ابحر الجنوبية
    ونيت نقل عفش بجدة
    دباب نقل عفش بجدة
    دينا نقل عفش بجده
    نقل عفش

    ReplyDelete

  3. شركة بريق متخصصة فى النظافة العامة بجدة تنظيف شقق فلل مكاتب مساجد تنظيف بالبخار بجدة تنظيف مفروشات ستائر مجالس سجاد كنب غسيل الاسطح وتنظيف الخزانات الكشف عن تسريبات المياه رش المبيدات وقتل الحشرات بأفضل الاسعار وأعلى جودة - شركة تنظيف منازل بجدة - شركة غسيل منازل بجدة - شركة تنظيف منازل بالبخار بجدة - شركة ترتيب وتنظيف المنازل بجدة - مكتب تنظيف منازل بجدة - شركة رسمية لتنظيف المنازل بجدة - مؤسسة رسمية لتنظيف المنازل بجدة - مين جربت شركات تنظيف المنازل بجدة - كم أسعار شركات تنظيف المنازل بجدة - أسعار شركات تنظيف المنازل بجدة - شركة تنظيف منازل بجده - تنظيف البيوت بجدة - شركة تنظيف كنب بجدة - شركة تنظيف بخار بجدة - شركة تنظيف فلل بجدة - شركة تنظيف شقق بجدة - شركة نظافة في جدة - شركة نقل عفش بجدة - ارخص شركه تنظيف بجده - شركه تنظيف بالبخار بجدة - تنظيف المسابح والخزانات - شركة تنظيف مجالس بجدة - شركة تنظيف مفروشات بالبخار بجدة - تنظيف الاسطح والارضيات - رش المبيدات ومكافحة الحشرات - عزل الأسطح والخزانات - نظافة شقق المغتربين والمقيمين - تنظيف الاستراحات تنظيف الخيام

    ReplyDelete
  4. Hey, I am Raaj Singh From India and I Like Your Article So Much, Your Provided Article Given Me a Lot of Information. I Hope that In Future You Will Provide Many Articles Similar to this Article. By the Way I Also Have an Article Regarding IGNOU Examination Form, So If You Read It Then It is Very Helpful me.

    ReplyDelete
  5. It is important for history coursework writing service students to seek History Essay Writing Services from a reputable history research paper service provider for their custom history paper writing help services.

    ReplyDelete
  6. بريق كليين افضل شركة تنظيف منازل بالباحة وقد تم تصنفيها ضمن افضل شركات التنظيف فى الباحة والتي تتميز بجودة الخدمات التي تقدمها وكفاءة فريق العمل القائم على هذه الأعمال، كما تتميز أيضا شركتنا بتوفير جميع خدمات التنظيف إلى العملاء، التنظيف الجاف و التنظيف بالبخار لذا عزيزي العميل إذا كنت ترغب في تنظيف المنازل و العمائر و الفلل والقصور و الشقق و المساجد و المكاتب و المؤسسات و الشركات و الفنادق و المدارس و الخزانات و المسابح و الارضيات و الأسطح و التعامل مع شركة تنظيف على مستوى عالي من الاحترافية والتميز، وكذبك شركة تنظيف منازل بالباحة لخدمات التنظيف فيمكنكم التواصل معنا للحصول على أفضل الخدمات التي نجتهد في تقديمها حتى تنال على إعجاب واستحسان عملائنا الكرام.
    شركة تنظيف بالباحة
    شركة تنظيف منازل بالباحة
    شركة غسيل منازل بالباحة
    شركة تنظيف منازل بالبخار بالباحة
    مكتب تنظيف منازل بالباحة
    شركة رسمية لتنظيف المنازل بالباحة
    مؤسسة رسمية لتنظيف المنازل بالباحة
    أسعار شركات تنظيف المنازل بالباحة
    تنظيف البيوت بالباحة
    شركة تنظيف منازل بالباحة
    شركة تنظيف كنب بالباحة
    شركة تنظيف بخار بالباحة
    شركة تنظيف فلل بالباحة
    شركة تنظيف شقق بالباحة
    شركة نظافة في الباحة
    شركة نظافة بالباحة
    ارخص شركه تنظيف بالباحه
    افضل شركه تنظيف بالباحه
    شركه تنظيف بالبخار بالباحة
    افضل شركة نظافة بالباحة
    تنظيف المسابح والخزانات
    شركة تنظيف مجالس بالباحة

    ReplyDelete
  7. I am Irene Garcia, a Corporate Sales Specialist operating in the Latin American region. We are experts in Market Research (Mercado Latinoamericano de Trituradoras de Chatarra). We provide market reports across various industry verticals and also help with procurement and supply chain intelligence. By giving you better insights, we help you make informed business decisions.

    ReplyDelete
  8. I am Irene Garcia, a Corporate Sales Specialist operating in the Latin American region. We are experts in Market Research (Mercado Español de Azulejos de Cerámica). We provide market reports across various industry verticals and also help with procurement and supply chain intelligence. By giving you better insights, we help you make informed business decisions.

    ReplyDelete
  9. I am Irene Garcia, a Corporate Sales Specialist operating in the Latin American region. We are experts in Market Research (Mercado Latinoamericano de Cerveza ). We provide market reports across various industry verticals and also help with procurement and supply chain intelligence. By giving you better insights, we help you make informed business decisions.

    ReplyDelete
  10. I am Irene Garcia, a Corporate Sales Specialist operating in the Latin American region. We are experts in Market Research (Mercado Latinoamericano de Cannabis Medicinal ). We provide market reports across various industry verticals and also help with procurement and supply chain intelligence. By giving you better insights, we help you make informed business decisions.

    ReplyDelete
  11. I am Irene Garcia, a Corporate Sales Specialist operating in the Latin American region. We are experts in Market Research (Mercado Latinoamericano de La mHealth ). We provide market reports across various industry verticals and also help with procurement and supply chain intelligence. By giving you better insights, we help you make informed business decisions.

    ReplyDelete
  12. I am Irene Garcia, a Corporate Sales Specialist operating in the Latin American region. We are experts in Market Research (Mercado Latinoamericano de Tintas ). We provide market reports across various industry verticals and also help with procurement and supply chain intelligence. By giving you better insights, we help you make informed business decisions.

    ReplyDelete
  13. NCERT provides various study resources to the students studying in different classes. Going through these study resources without fail helps the students in NCERT 6th Class Textbook understanding the details of the course, enabling them to be prepared throughout the academic year. Exam preparation, without a doubt, is much more convenient with these NCERT resources

    ReplyDelete
  14. The students will be able to check their UP Board Result for Class 12 via SMS also. The UP Board 12th Result 2021 would comprise the student's roll UP Board 12th Result 2021 Around 25 lakh students will be able to check their UP Board Result 2021 Class 12 for the UP 12th examination which will be conducted in April and May 2021.

    ReplyDelete

  15. The students will be able to check their UP Board Result for Class 12 via SMS also. The UP Board 12th Result 2021 would comprise the student's roll UP Intermediate Result 2021 Around 25 lakh students will be able to check their UP Board Result 2021 Class 12 for the UP 12th examination which will be conducted in April and May 2021.

    ReplyDelete
  16. إليك خطوات تنظيف مكيف السيارة
    مع اقتراب دخول فصل الصيف ودرجات الحرارة المرتفعة التي تصاحبه، فإن الإهتمام بالمكيف يصبح ضرورة قصوى، حيث أثبتت الدراسات الحديثة أن درجات الحرارة العالية تزيد من نسبة الحوادث
    حيث أنه مع ارتفاع درجة حرارة المقصورة يفقد السائق تركيزه وبالتالي يصبح أكثر عرضة لمخاطر القيادة.
    لذا يصبح من المنطقي الاهتمام بصيانة المكيف ونظافته، لتقديم أفضل أداء ممكن، ونستعرض فيما يلي كيفية تنظيف مكيف السيارة.
    شركة تنظيف مكيفات بالدمام
    غسيل سيارات متنقل بالدمام
    شركة تنظيف مكيفات بالخبر

    ReplyDelete