algebraicthunk.net/ blog/ entry/ How much speed are you willing to give up for pretty code?

After experimenting a bit with Haskell and using it for some toy projects, I decided to see if there was any chance it could be used for a non-toy task; i.e., one where it might one day be necessary to actually have fast code. As a proxy for this, I decided to try a very simple task: counting the number of lines in a file. As input, I generated a file with a million lines of random data and an average of 40 bytes per line; all the Haskell programs were compiled with the current GHC in Debian unstable and with optimizations enabled.

The baseline is, of course, our friend "wc":

real    0m0.084s
user    0m0.043s
sys     0m0.040s

(this and all other timing data was taken after "cat /tmp/testwc.txt > /dev/null" to ensure that the file was cached) Now, it's probably too much to hope that a Haskell program will run as fast as a C program does, since the C program can run in a tight loop. But it seems reasonable that it should be possible for Haskell to be competitive with the trivial Python implementation of this program -- after all, Python is interpreted and Haskell is compiled.

import sys
n = 0
for line in sys.stdin:
   n += 1
print n

Running this simple Python program under "time" produces the following statistics, a bit less than 10 times slower than the C version:

real    0m0.698s
user    0m0.643s
sys     0m0.050s

The trivial Haskell program to count the lines of its input is:

main = do c <- getContents ; print $ length $ lines c

Note that this is basically identical to the Python program, but expressed in a functional style. Pretty, but how does it stack up?

real    0m6.744s
user    0m6.426s
sys     0m0.213s

OW! Ten times slower than the Python code, and a good 80 times faster than "wc"! The gap can be lowered to "only" 50 times with the following program --

do c <- getContents; print $ length $ elemIndices '\n' c

-- but surely it's possible to do better. I'd expect that the expense here is in the use of lazy streams for IO; eliminating them gives us the following code:

wc :: Handle -> IO Int
wc h = do a <- mallocArray bufSize :: IO (Ptr Word8)
          loop a 0
    where loop a n = do amt  <- hGetBuf h a bufSize
                        elts <- peekArray amt  a
                        (if amt == 0
                         then return n
                         else loop a $! n + nlCount elts)

At this point, we're basically using Haskell as a (bad) imperative language. This function allocates a buffer, repeatedly reads into it, and counts the number of newlines in the buffer (I dropped some auxillary code to keep the post short). The $! is important; without it, even though this routine is tail-recursive, Haskell will still allocate an unbounded amount of memory to store the return value. (because, of course, it's terribly important to remember that you're actually returning "1+2+3", not "6")

So, how does this unreadable, hand-optimized version of the code perform?

real    0m1.980s
user    0m1.757s
sys     0m0.094s

That's right -- I've eliminated most uses of higher-level functional techniques, and the Haskell version is still two times slower than an ultra-high-level interpreted language, while also being uglier and harder to read! Ouch. It appears the old performance albatross of functional languages is not quite dead yet.

Comment by Joe Wreschnig at 10:01 PM:

How does Python do for better code?

import sys print len(sys.stdin)

Comment by pvaneynd at 10:30 PM:

My understanding of Haskell is almost zero, but from what I gather the tree programs do different things: the C and the Python program read a file line by line, the Haskell programs (all three of them as far as I can tell) read the whole file in one go and then count the number of newlines in them.

This is a radically different thing from reading a file line by line, not only are the memory requirements different, also the kernel has less information on how to prefetch the new few lines, nor can it see with ease that the old lines can be swapped out.

I'm willing to bet that writing a Haskell program that reads line by line will be significantly faster.

Comment by pvaneynd at 10:50 PM:

Following up, in Common Lisp I get:

Base wc result:

real 0m0.018s user 0m0.004s sys 0m0.012s

Reading the file in one swoop, counting newlines:

Evaluation took: 2.89 seconds of real time 2.360147 seconds of user run time 0.152009 seconds of system run time 43 page faults and 122,947,360 bytes consed.

Reading line by line:

Evaluation took: 0.35 seconds of real time 0.260016 seconds of user run time 0.072004 seconds of system run time 0 page faults and 35,717,576 bytes consed.

The last function can be improved upon by using a static buffer for the lines, thus removing a lot of the consing. Oh, and by using a newer version of the Lisp that has faster unicode processing.

Comment by Joe Wreschnig at 10:52 PM:

Sorry, that should be

import sys print len(list(sys.stdin))

Or possibly, import sys print len(sys.stdin.readlines())

I'm not actually sure which will be faster for that size of file.

Comment by Anonymous at 1:25 AM:

Out of interest, how well does this do:

perl -le '$i++ while <>; print $i'

Comment by ecb at 2:20 AM:

The real problem is in the representation of strings. String is really a linked list of Char; try the foilowing, which uses the FastPackedString library:

module Main where

import IO import qualified Data.FastPackedString as FPS

main :: IO () main = do fs <- FPS.hGetContents stdin putStrLn $ show $ length (FPS.lines fs)

This isn't really the kind of problem Haskell was designed for, of course, but it's nice to know that there are escapes if you really need speed.

Comment by Anonymous at 2:31 AM:

This has been discussed on the haskell-cafe mailling-list: http://thread.gmane.org/gmane.comp.lang.haskell.cafe/3907

Comment by Antti-Juhani Kaijanaho at 4:20 AM:

I did some tests myself. My best Haskell program, though unidiomatic, was essentially as fast as wc -l.

I just tried out the FastPackedString version suggested ecb. Its performance sucked worse than that of the trivial list-based one.

Comment by ecb at 6:09 AM:

I'm surprised... I tried the naive lazy list version and the FPS version, along with wc. The FPS version gives (for a one million line file):

real 0m1.257s user 0m0.810s sys 0m0.418s

wc gives:

real 0m0.844s user 0m0.451s sys 0m0.275s

I didn't run the lazy list version for long enough to get results...

Comment by dburrows at 6:23 PM:

pvaneynd, the "getContents" call generates a lazy stream which ought to just be read as it is consumed, modulo buffering; so no, it doesn't read the whole file at once. I wasn't too surprised that this performed poorly, although I was surprised at just how dreadful it was (I would have naively guessed more like 10 times the C version).

The third version is an explicit, imperative, tail-recursive loop like you'd do in Lisp. An interesting note about it is that performance actually decreases as the buffer size increases, to the point that it's worse than the purely functional version! I'm still trying to work that one out.

Comment by reading speed at 12:33 AM:

This post has been removed by a blog administrator.

Comment by dburrows at 8:13 PM:

I apologize for shutting comments on this post down, but it seems to have become a bit of a spam magnet. Email me if you have comments and I'll attach them here.