Tablify is a tool to render CSV into tables of various formats, including HTML, tbl, and character art (both ASCII and Unicode). What motivated me to write this was discovering the way sexy cool UTF8 box characters. They appear to render really well in Apple’s new Menlo font—at least substantially better than they do in Microsoft’s Consolas font which was my previous fave. They seem to have a lot in common.
It also does HTML and tbl(1) output.
For kicks, I decided to write it in Haskell. I was actually contemplating what language I am the most fluent in, and I found myself running a lot of sloccount and wondering how exactly to compare my output in various languages. Two ways that seemed promising were to compare by number of files I've created and by number of lines of code. Here’s what I found:
|Language||# of Files|
|Language||# of Lines|
I think there are some interesting natural divisions in here. For example, PHP, Python and Ruby are at the top in both categories. This reflects the fact that most of my professional programming experience has been in these languages. Towards the bottom, there’s my old friends C, Erlang and Clojure. Clojure undoubtedly is there because it’s the language I'm the newest with; the others because I don’t really like them that much (no offense, Erlang, I think you’re great).
Funny thing about Prolog is, I think I start a lot of programs in Prolog but I don’t finish very many of them. Lisp has the opposite phenomena; it looks like my average Lisp file is fairly large, or else I have some fairly large Lisp files. I think I've only ever really written one small-to-medium sized app in Lisp and it certainly took more code than I expected it to.
I think what’s interesting about this table is the stuff in the middle: OCaml and Haskell. I haven’t really written an OCaml program in some time but I still think of it as comparable in power and utility to C++, whereas Haskell is more on par in terms of complexity (thankfully neither is syntactically). So I started looking through my little files and I found a surprising spread of functionality amongst my Haskell programs. I had a few network servers, several parallel calculations, and some other stuff. Another interesting thing is that I didn’t really see any Haskell programs sitting in a broken incomplete state. There were several incomplete programs but few broken ones. I even found a program called “ritual.hs” which I had apparently started but abandoned for doing nothing but exactly this kind of linguistic archeology for me.
As I was making my little chart I happened on the Unicode box drawing characters and the idea was born. So I decided to try doing it in Haskell and see if my attitude towards the world’s most useless programming language has changed.
As usual, Haskell made some of the program trivial and other parts rather difficult. I didn’t find the I/O to be particularly hard. Most of the hair pulling came from changing interfaces and learning more about the module system, compounded by some issues having to do with the update to Mac OS X 10.6.
Initially, I thought I would want my internal data structure to be an array, so I used Data.Array. I've since changed it to a list of lists, which would have been the obvious choice, and removed quite a bit of code in the process without (to my knowledge) affecting the runtime characteristics much.
There was one rather nasty stack-blowing bug. I found a fair amount of advice online for fixing it but on reflection all I had to do was replace
foldl' in one place. This took an hour or two.
There are many places I feel the code is obtuse or poorly factored. Reading Real World Haskell gives me the impression that I'm still an early grade of Haskell programmer. The book does a good job of showing the evolution of code from horrifying to mediocre to grand. Mine’s on the horrifying side in places, on the grand side in others, but not very far from the middle anywhere. I notice in addition to isolating the monadic code, I tend to try and limit the amount of it. There were several places, particularly in the HTML generation facility, where I thought I could eschew some complexity with the appropriate monad if only I knew which one and how to go about it.
There’s also a built-in HTML combinator library I could be using instead of the homebrew one I made, as well as a third-party library that provides a
ByteString-based CSV parser. I tried to port over to it but ran into issues with
String literals, then found the
Data.ByteString.Char8 module and the third-party
Data.ByteString.UTF8 module, but none of these seem to be using the same
ByteString. For example, if I import
Data.ByteString.UTF8 and use its
fromString function, I can’t seem to pass it to the output function
Data.ByteString.putStrLn. This is currently one of the other big issues I have.
So essentially, while I have understood monads for some time at least conceptually, I seem to lack both the library knowledge of which ones to use when, and the experiential knowledge of how to apply them and perhaps create my own. Obnoxiously, I see a fair number of little examples of monads in the programs I wrote to learn Haskell but I don’t see how to apply them inside an actual program effectively.
I guess you could say my relationship with Haskell continues to be characterized by love of the ideals and hate of the hardships in getting to proficiency. I suspect a big part of the problem is trying to learn Haskell on one’s own. Yet, even if I write poor Haskell, I can be assured that if it compiles, it’s likely to be correct.
Also next on my list of things to learn for Haskell is QuickCheck. My eyes glazed over when I read about coarbitrary: “The coarbitrary method interprets a value of type a as a generator transformer. It should be defined so that different values are interpreted as independent generator transformers.” Sure. I’ll get right on that.
Highlights of my code:
-- from FixedWidth.hs -- given a table, return a list of the maximum widths of each column columnWidths :: Table -> [Integer] columnWidths table = map (maximum . map length) $ transpose table -- from HTML.hs -- escapes the bare minimum of characters for safe embedding in HTML htmlEscape :: String -> String htmlEscape html = foldl' process html mappings where mappings = [("&", "&"), ("<", "<"), (">", ">")] process text (find, replace) = subRegex (mkRegex find) text replace
Pretty much everything else is a mess. I welcome any advice.
This is my analysis of Republicans, religion and the triumph of unreason by Johann Hari.
First of all, in broad strokes, I agree with the description of the symptoms in the essay. The American right is basically living in another universe. When Bush was in power, the liberal conspiracy theories were plausible: that Bush wanted to invade Iran, for example. The new conspiracy theories are just insane. I am reasonably certain that Obama’s first health care reform bill will be a major disaster, but only because they are rushing to get it through and with a bunch of democrats in power it’s like to be padded with tons of pork. The fears that are being passed around are completely absurd. Isn’t it enough to be afraid of the likeliest case, that we’ll just be saddled with more useless government? I hope Obama can do better, but statistically his odds are not great.
However, I think it’s silly to blame religion for this problem. With atheism at around 15% of the population, Obama is not in power because of some kind of nationwide repudiation of religion. This is the chief vanity of the left, that it represents a harbinger of the kind of “rationality” espoused by certain intellectuals who happen to be atheists.
It’s tempting to put the blame at religion’s doorstep because it was supposedly the rallying cry of the right. But the truth about the American right is that it is a massive hodgepodge of vastly different groups with almost nothing in common. And guess what? So is the left. Politicians use propaganda to acquire power, not because they actually believe it. How can you accuse Bush and his cronies of corruption and then simultaneously blame their ideals? Obviously if their ideals meant anything they wouldn’t have been corrupt in the first place. Instead we hate them for what they've done and then we hate them more for being hypocrites. This is a game two can (and will) play.
It’s also tempting to blame religion because it is a handy target and has few defenders. Especially when couched in gerrymandered terminology like “organized religion.” What makes this article’s kind of analysis so disgusting to me is the assumption that the essence of religion is faith. In American Christianity, the foundation of the religion is faith, but this is like saying the problem with Congress is that they meet in a building with a floor. The problem with the American right is that they’re credulous, but guess what? That’s the problem with the American left as well! The common factor I see is the American education system. The 2.2% that are homeschooled would be just as statistically irrelevant to the counter-argument as the atheists above. If what you hate is Christianity, why not just come right out and say it and why?
For every Bible-thumping redneck we love to hate, there’s a suburban driving a gas guzzling SUV with a bike rack on top who lives 20 miles from where they work. Evidence from psychological studies that abortions are highly detrimental to the mother’s mental health are discarded from the abortion debate because that debate is about “rights,” but they then turn around and spin arguments about the environment’s health being grounds for restricting rights. Which one is it? You can argue both rationally. It would be better economically to legalize weed and prostitution and regulate them. Which rational argument do you want to make?
This infatuation with rationality is handy in a debate, but what do you do if you have contradictory facts? Upon what do you base a rational debate about ethics? Assertions? Religions supply a foundation for these kinds of debates that cannot otherwise be deduced. There is no fundamental particle that carries good or evil. This means science is of limited use in many of the most important debates. You'd think the author of this essay encourages people to marry on the basis of empirical evidence from observation and calculations! Apparently mechanical hearts can also bleed!
The best part is, this author doesn’t have a single statistic to back up his assertions about religion. It’s all speculation. Angels dancing on the head of a pin, if you will. But because it is so well argued, it will further reinforce in the mind of the intellectual left that they’re fighting a war against religion rather than a war against propaganda and irrationality. In other words, this article is itself leftist propaganda that distracts from the real issues with made-up facts! Isn’t that the very thing this article is against? This self-certainty about ones interpretation of reality is groupthink exactly and it is what causes the continual pendulum sway of American politics.
Groupthink is certainly preventing the health care debate from taking shape the right way. I think intentional meddling might be a bigger problem. Responding to redder groupthink with bluer groupthink isn’t a solution. Mr. Hari would do better to analyze both sides of the debate with the same kind of cynicism.
PostgreSQL 8.4 is the first version to introduce OLAP functionality in the form of the windowing functions. These functions are new to me too but they look quite interesting and powerful. They will greatly simplify combining many large SQL statements into simpler ones for complex reporting, as well as make it possible to push some “fixup” processing that depends on ordering back into the database.
As an example, I'm going to be using the
orders table from the (Dell store example database)[http://pgfoundry.org/projects/dbsamples/]. A few rows from this table look like this:
SELECT orderid, orderdate, customerid, netamount::text::money, tax::text::money, totalamount::text::money FROM orders LIMIT 10;
Without getting into the new features, it’s possible to generate interesting summary information. You might want to see how much you made each month, for example:
SELECT EXTRACT(MONTH FROM orderdate) AS month, EXTRACT(YEAR FROM orderdate) AS year, SUM(totalamount)::text::money AS total FROM orders GROUP BY year, month ORDER BY year, month;
What we’re doing here is just grouping on some computed data, in this case, EXTRACT is giving us the information we need from the DATE to use it for grouping. An interesting fact about grouping is that it restricts you to relational calculations. You cannot, for example, do something different depending on where you are in the calculation. Grouping also applies to the whole statement, making it hard to make a complex report without using many subselects.
The window functions were created to support OLAP or Off-Line Analytical Processing, as opposed to OLTP or On-Line Transaction Processing which is what SQL RDBMSes are generally thought of. Despite this nobody has managed to give me an example of what OLAP is like in reality, so it must be a pretty specific and rare topic. However, the nutshell of windowing is that it lets you combine multiple different groupings in the same statement, or do calculations with respect to your current location in the generated result relation.
What if you want to also calculate the year-to-date earnings for each month? Without the window functions I'm not sure this is possible in straight SQL. What if you want to rank each month relative to the other months in terms of income?
Windowing essentially is a modification of an aggregate function with a new clause,
OVER. There are some new functions too, like
rank() which is essentially the row number. The
OVER clause itself is built out of two optional expressions, an
ORDER BY expression which is the same as the kind you already know, and a
PARTITION BY which works much like
GROUP BY with some extra features.
The rank of a given month is clearly related to the total amount made in that month, so its
OVER clause will be
ORDER BY total DESC since the largest amount should come first. Let’s add that to the query:
SELECT EXTRACT(MONTH FROM orderdate) AS month, EXTRACT(YEAR FROM orderdate) AS year, SUM(totalamount)::text::money AS total, rank() OVER (ORDER BY total DESC) FROM orders GROUP BY year, month ORDER BY year, month;
This is already giving us some information we ordinarily would have had to infer or calculate on the other side. Now let’s say you want to also see the earnings year-to-date for each month. Our first try isn’t going to work:
SELECT EXTRACT(MONTH FROM orderdate) AS month, EXTRACT(YEAR FROM orderdate) AS year, SUM(totalamount)::text::money AS total, rank() OVER (ORDER BY total DESC), SUM(total)::text::money OVER (ORDER BY year, month) AS ytd FROM orders GROUP BY year, month ORDER BY year, month;
ERROR: column "total" does not exist LINE 6: SUM(total)::text::money OVER (ORDER BY year, month) AS ytd ^
What this message is showing us is that our SUM(totalamount) AS total isn't visible inside the OVER clause of the other column. Fortunately, this is easy to work around by making the original query a subselect of the new query like so:
SELECT *, SUM(total) OVER (ORDER BY year, month)::text::money AS ytd FROM (SELECT EXTRACT(MONTH FROM orderdate) AS month, EXTRACT(YEAR FROM orderdate) AS year, SUM(totalamount)::text::money AS total, rank() OVER (ORDER BY total DESC) FROM orders GROUP BY year, month ORDER BY year, month) AS q ORDER BY year, month;
As an aside, you really shouldn’t use the
money type in PostgreSQL. I only used it here to make the formatting a little prettier. In general you should use
NUMERIC(X,2) where X is a number larger than you expect to ever have in your database. If you have to deal with multiple countries (or might someday) you should also track the currency code and perhaps even the conversion rate to your native currency code on that particular day. It gets complex in a hurry.
I should have brought this up in the previous post. The question on everyone’s mind must be, if SQL has recursion, how do you compute the Fibonacci sequence? Here’s how:
SELECT fib FROM (WITH RECURSIVE fibonacci(fib, fib1) AS ( SELECT 1, 0 UNION SELECT 1 :: NUMERIC, 1 :: NUMERIC UNION SELECT fib + fib1, fib FROM fibonacci ) SELECT * FROM fibonacci) AS fibs LIMIT 1000;
I couldn’t see a way to refer to the previous two rows, so instead I just carry along the previous row’s value inside the current row, so the next calculation can see the previous two values. I wrapped the whole thing in a
SELECT to hide
NUMERIC type gives you arbitrary-precision math (or at least to 1000 decimals). Without it, the maximum number you can calculate is 46 because 1134903170 + 1836311903 overflows the normal integer.
Apparently I missed the news: PostgreSQL 8.4 came out on July 1st! This release brings two new features and greatly improves one old feature. This post will be about recursive queries using
A classic problem in SQL is how to model trees. In web development you often have to store nested subcategories or model pages as files inside folders. The simplest proper way to model this relationship in SQL is to add a self-referencing column to the table to point to a row’s parent row. For example, you might represent the file path
/foo/bar/baz.html with the following table:
This structure has a number of pleasant features from a relational perspective. For one thing, you can rename a folder by modifying just one row. You can insert a new node anywhere in this hierarchy by just knowing the parent folder’s ID. You can also move a node anywhere in the hierarchy by modifying that one ID.
Unfortunately, from an implementation perspective, this structure sucks for doing many common operations: computing a string version of a path, for example, or listing all the descendants under a given node. There are many simpler, less normalized methods that are better at these common operations at the expense of some relational normality. (One I have used is to have outside software maintain a text path in the database.)
With recursive queries it’s now possible to recreate the string path, find all descendant nodes, and get all of the parent nodes up to the root. I'm going to show you how to do each of these things with just one query apiece.
All of the magic comes down to the
WITH statement. This statement is provided both to support recursive queries and also for common table expressions or CTEs. If you find yourself repeating the same table definition in your query you can pull it out and put it above the statement in a
WITH clause. You can think of this as a way to make a temporary view just for a single SQL statement.
The basic structure you’re looking at is this:
WITH temp_view(col1, ...) AS (SELECT ...) SELECT ...;
If you use
temp_view inside the
SELECT within the
WITH statement, you have to add
WITH RECURSIVE temp_view(col1, ...) AS (SELECT ...) SELECT ...;
All three of the queries I'm going to show you rely on the recursive flavor in order to do their work. There is a design pattern in the inner
SELECT whenever a recursive query is written, which is related to the notion of recursion. You must define a base case and an inductive case. The base case sets up the initial conditions for the recursion, usually the first set of rows for the result. The inductive case uses the rows which have already been computed to compute more rows and decide whether or not the recursion is complete.
For example, to find the breadcrumbs for a given path, the base case is the initial path. The inductive case is just to find the parent path. If there’s no parent path, the recursion is complete. Inside the SQL, you’ll do a naïve
UNION it with a
SELECT that uses the recursively defined name. You can read more about this in the (PostgreSQL Documentation)[http://www.postgresql.org/docs/8.4/interactive/queries-with.html]. So the SQL will look like this:
WITH RECURSIVE breadcrumb(id, name, parent_id) AS ( SELECT id, name, parent_id FROM fs WHERE id = 4 UNION SELECT fs.id, fs.name, fs.parent_id FROM fs, breadcrumb WHERE breadcrumb.parent_id = fs.id) SELECT * FROM breadcrumb;
The base case gets all of the information for the row we’re interested in, which is the row with ID 4. The inductive case takes the previously generate row and gets the row from
fs with the ID of the parent ID of the row with ID 4, which in this case is the row with ID 3. The next iteration finds the row from
fs with the ID of the parent ID of the row with ID 3, and so on.
If you’re following along at home, run this SQL to set up our initial conditions:
CREATE TABLE fs ( id SERIAL PRIMARY KEY, name VARCHAR, parent_id INTEGER REFERENCES fs(id) ); INSERT INTO fs (name, parent_id) VALUES ('/', NULL), ('foo', 1), ('bar', 2), ('baz.html', 3);
Now you should get this output from the above SQL:
Let’s work on the other two problems: getting a string path and listing descendants of a given node. We’ll use the root node for the sake of fun and add a few extra nodes so that the tree is fuller and tests our query a little better:
INSERT INTO fs (name, parent_id) VALUES ('under foo', 2), ('under bar', 3), ('under bar 2', 3), ('baz', 2), ('bazzle.html', 8);
Now, to do this query the base case would seem to be the root of the tree. In thinking about what kind of result we want, I think for convenience we might want the current node’s name and the full path of this node in addition to the regular columns. We’re going to assume that the root of the tree is the node with ID 1, which gives us this base case:
WITH RECURSIVE path(name, path, parent, id, parent_id) AS ( SELECT name, '/', NULL, id, parent_id FROM fs WHERE id = 1
Now the inductive case just takes that case and computes the descendant nodes of that node by looking at their parent_id:
SELECT fs.name, parentpath.path || CASE parentpath.path WHEN '/' THEN '' ELSE '/' END || fs.name, parentpath.path, fs.id, fs.parent_id FROM fs, path as parentpath WHERE fs.parent_id = parentpath.id
Calculating the path, the
CASE statement you see there is a hack to prevent extra ‘/’s at the start of the path because of the UNIX convention of paths starting at /. Otherwise you see I'm just taking the name of this node and concatenating the parent path with my name, separated by /’s. So this is the full version:
WITH RECURSIVE path(name, path, parent, id, parent_id) AS ( SELECT name, '/', NULL, id, parent_id FROM fs WHERE id = 1 UNION SELECT fs.name, parentpath.path || CASE parentpath.path WHEN '/' THEN '' ELSE '/' END || fs.name, parentpath.path, fs.id, fs.parent_id FROM fs, path as parentpath WHERE fs.parent_id = parentpath.id) SELECT * FROM path;
Let’s see what it does!
|under foo||/foo/under foo||/foo||5||2|
|under bar||/foo/bar/under bar||/foo/bar||6||3|
|under bar 2||/foo/bar/under bar 2||/foo/bar||7||3|
Looks like it worked! But now that we have this statement it seems like it would be handy to make it a function so you can find the descendants of any particular node. The path might not look right but at least you could get relative children. Let’s try it:
CREATE FUNCTION children_of(root_id INTEGER) RETURNS TABLE (name VARCHAR, path VARCHAR, parent VARCHAR, id INTEGER, parent_id INTEGER) AS $$ WITH RECURSIVE path(name, path, parent, id, parent_id) AS ( SELECT name, '/', NULL, id, parent_id FROM fs WHERE id = $1 UNION SELECT fs.name, parentpath.path || CASE parentpath.path WHEN '/' THEN '' ELSE '/' END || fs.name, parentpath.path, fs.id, fs.parent_id FROM fs, path as parentpath WHERE fs.parent_id = parentpath.id) SELECT * FROM path; $$ LANGUAGE 'sql';
All I've done here is wrap the previous statement with some stuff to make it into a function and replaced 1 with $1, which makes the value the first parameter of the function call. For extra convenience, if I were using this as the foundation of a CMS or something where I would expect to need the path frequently, I might also make a view:
CREATE VIEW paths AS SELECT * FROM children_of(1);
It’s worth noting that recursive queries probably do not have excellent performance characteristics (I'm not certain but it seems likely.) If you do run into performance problems with this, you might consider looking into (materialized views)[http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views] which can be implemented manually in PostgreSQL using triggers.
If you prefer, you could instead create a function to find the path of a given ID:
CREATE FUNCTION pathname(id INTEGER) RETURNS VARCHAR AS $$ SELECT '/' || path FROM (WITH RECURSIVE pathto(path, id) AS ( SELECT name, parent_id FROM fs WHERE id = $1 UNION SELECT fs.name || '/' || pathto.path, parent_id FROM fs, pathto WHERE fs.id = pathto.id) SELECT * FROM pathto) AS pathto(path, id) WHERE id = 1; $$ LANGUAGE 'sql';
I was then tempted to make an index on this function, but PostgreSQL won’t allow computed indices using functions not marked
IMMUTABLE, which is a way of certifying to PostgreSQL that it is a pure function. Because this function will return different values depending on the content of a table, it isn’t immutable, and the index could get out-of-sync with reality without PostgreSQL knowing it. Intuitively, if you rename a node, all of its children will then have different paths reflecting the new name but it’s unclear how you could make PostgreSQL aware of that fact.
Plan 9 is Unix sans castration.
The central idea behind Plan 9 is that a filesystem is a dandy interface to all kinds of things. Take a look at the list of system calls in Plan 9, all 51 of them.
What kinds of things can be filesystems that typically aren’t? All your processes—post a note (remember signals in Unix?) by writing to their control file, see what files they have open, etc. Your network interface. Need NAT? Mount the gateway’s network devices locally. Get a connection and listen or connect to a remote host by opening a file and writing to it. Access a database through the filesystem. Access the web through the filesystem. Access FTP or any other remote filesystem through the local filesystem. Manipulate this window on the screen through the filesystem. Draw graphics through the filesystem. Access your keychain through the filesystem. Access your sound card through the filesystem.
Exposing everything through the filesystem has three principle benefits:
- Now any aspect of your machine’s hardware, display, or any running software can be accessed from other places on the network. Or vice versa.
- Want to improve security by jailing a process? Remove the interfacing files from the namespace of that process.
- Access to your abstraction is democratized and language-agnostic.
The first two points get made a lot in Plan 9 circles and are the backbone of a secure distributed environment. But the third point is one that’s been fascinating me. The echo service on Plan 9 is a shell script that runs
cat. You can use regular shell scripts to connect programs and devices across machines. If you find a filesystem that provides access to a given device, protocol, or whatever, every language on your machine gets uniform access to that device, protocol or whatever without any bias towards a particular one.
This also means that in Plan 9, writing a filesystem is better in many cases than writing a library. A library can only be used from C; a filesystem can be used from C, the shell,
awk or any other language. Furthermore, a filesystem can be exported over the network or imported from another host. Hooking up a file to another file doesn’t sound very interesting, but if one file is the output of a progress bar application and the other is your soundcard’s control file, it becomes interesting. Or if one file is your mail and the other is your Jabber account, and between them is a little AWK script that pulls out the subject. Or if one file is your hard drive’s SMART status and the other is your SMS gateway. The mere thought of being able to combine these things with trivial shell scripts in the grand Unix tradition is quite alluring.