Tablify CSV Utility

Posted by Daniel Lyons on September 11, 2009

Tablify is a tool to render CSV into tables of various formats, including HTML, tbl, and character art (both ASCII and Unicode). What motivated me to write this was discovering the way sexy cool UTF8 box characters. They appear to render really well in Apple’s new Menlo font—at least substantially better than they do in Microsoft’s Consolas font which was my previous fave. They seem to have a lot in common.

Obligatory screenshot:

screenshot of tablify making pretty Unicode and ASCII art

It also does HTML and tbl(1) output.

For kicks, I decided to write it in Haskell. I was actually contemplating what language I am the most fluent in, and I found myself running a lot of sloccount and wondering how exactly to compare my output in various languages. Two ways that seemed promising were to compare by number of files I've created and by number of lines of code. Here’s what I found:

Language # of Files
PHP 196
Python 158
Ruby 146
Haskell 144
OCaml 88
Prolog 75
Lisp 63
C 44
Erlang 41
Clojure 30
Language # of Lines
PHP 12,749
Python 10,476
Ruby 7,051
OCaml 5,817
Lisp 5,058
Haskell 3,256
C 1,412
Prolog 1,349
Clojure 1,177
Erlang 964

I think there are some interesting natural divisions in here. For example, PHP, Python and Ruby are at the top in both categories. This reflects the fact that most of my professional programming experience has been in these languages. Towards the bottom, there’s my old friends C, Erlang and Clojure. Clojure undoubtedly is there because it’s the language I'm the newest with; the others because I don’t really like them that much (no offense, Erlang, I think you’re great).

Funny thing about Prolog is, I think I start a lot of programs in Prolog but I don’t finish very many of them. Lisp has the opposite phenomena; it looks like my average Lisp file is fairly large, or else I have some fairly large Lisp files. I think I've only ever really written one small-to-medium sized app in Lisp and it certainly took more code than I expected it to.

I think what’s interesting about this table is the stuff in the middle: OCaml and Haskell. I haven’t really written an OCaml program in some time but I still think of it as comparable in power and utility to C++, whereas Haskell is more on par in terms of complexity (thankfully neither is syntactically). So I started looking through my little files and I found a surprising spread of functionality amongst my Haskell programs. I had a few network servers, several parallel calculations, and some other stuff. Another interesting thing is that I didn’t really see any Haskell programs sitting in a broken incomplete state. There were several incomplete programs but few broken ones. I even found a program called “ritual.hs” which I had apparently started but abandoned for doing nothing but exactly this kind of linguistic archeology for me.

As I was making my little chart I happened on the Unicode box drawing characters and the idea was born. So I decided to try doing it in Haskell and see if my attitude towards the world’s most useless programming language has changed.

As usual, Haskell made some of the program trivial and other parts rather difficult. I didn’t find the I/O to be particularly hard. Most of the hair pulling came from changing interfaces and learning more about the module system, compounded by some issues having to do with the update to Mac OS X 10.6.

Initially, I thought I would want my internal data structure to be an array, so I used Data.Array. I've since changed it to a list of lists, which would have been the obvious choice, and removed quite a bit of code in the process without (to my knowledge) affecting the runtime characteristics much.

There was one rather nasty stack-blowing bug. I found a fair amount of advice online for fixing it but on reflection all I had to do was replace foldr with foldl' in one place. This took an hour or two.

There are many places I feel the code is obtuse or poorly factored. Reading Real World Haskell gives me the impression that I'm still an early grade of Haskell programmer. The book does a good job of showing the evolution of code from horrifying to mediocre to grand. Mine’s on the horrifying side in places, on the grand side in others, but not very far from the middle anywhere. I notice in addition to isolating the monadic code, I tend to try and limit the amount of it. There were several places, particularly in the HTML generation facility, where I thought I could eschew some complexity with the appropriate monad if only I knew which one and how to go about it.

There’s also a built-in HTML combinator library I could be using instead of the homebrew one I made, as well as a third-party library that provides a ByteString-based CSV parser. I tried to port over to it but ran into issues with ByteStrings and String literals, then found the Data.ByteString.Char8 module and the third-party Data.ByteString.UTF8 module, but none of these seem to be using the same ByteString. For example, if I import Data.ByteString.UTF8 and use its fromString function, I can’t seem to pass it to the output function Data.ByteString.putStrLn. This is currently one of the other big issues I have.

So essentially, while I have understood monads for some time at least conceptually, I seem to lack both the library knowledge of which ones to use when, and the experiential knowledge of how to apply them and perhaps create my own. Obnoxiously, I see a fair number of little examples of monads in the programs I wrote to learn Haskell but I don’t see how to apply them inside an actual program effectively.

I guess you could say my relationship with Haskell continues to be characterized by love of the ideals and hate of the hardships in getting to proficiency. I suspect a big part of the problem is trying to learn Haskell on one’s own. Yet, even if I write poor Haskell, I can be assured that if it compiles, it’s likely to be correct.

Also next on my list of things to learn for Haskell is QuickCheck. My eyes glazed over when I read about coarbitrary: “The coarbitrary method interprets a value of type a as a generator transformer. It should be defined so that different values are interpreted as independent generator transformers.” Sure. I’ll get right on that.

Highlights of my code:

        -- from FixedWidth.hs
        -- given a table, return a list of the maximum widths of each column
        columnWidths :: Table -> [Integer]
        columnWidths table = map (maximum . map length) $ transpose table
        -- from HTML.hs
        -- escapes the bare minimum of characters for safe embedding in HTML
        htmlEscape :: String -> String
        htmlEscape html = foldl' process html mappings
                mappings = [("&", "&amp;"), ("<", "&lt;"), (">", "&gt;")]
                process text (find, replace) = subRegex (mkRegex find) text replace

Pretty much everything else is a mess. I welcome any advice.