PostgreSQL 8.4: Windowing Functions
PostgreSQL 8.4 is the first version to introduce OLAP functionality in the form of the windowing functions. These functions are new to me too but they look quite interesting and powerful. They will greatly simplify combining many large SQL statements into simpler ones for complex reporting, as well as make it possible to push some "fixup" processing that depends on ordering back into the database.
As an example, I'm going to be using the orders table from the (Dell store example database)[http://pgfoundry.org/projects/dbsamples/]. A few rows from this table look like this:
SELECT
orderid, orderdate, customerid,
netamount::text::money,
tax::text::money,
totalamount::text::money
FROM orders LIMIT 10;
| orderid | orderdate | customerid | netamount | tax | totalamount |
|---|---|---|---|---|---|
| 1 | 2004-01-27 | 7888 | $313.24 | $25.84 | $339.08 |
| 2 | 2004-01-01 | 4858 | $54.90 | $4.53 | $59.43 |
| 3 | 2004-01-17 | 15399 | $160.10 | $13.21 | $173.31 |
| 4 | 2004-01-28 | 17019 | $106.67 | $8.80 | $115.47 |
| 5 | 2004-01-09 | 14771 | $256.00 | $21.12 | $277.12 |
| 6 | 2004-01-11 | 13734 | $382.59 | $31.56 | $414.15 |
| 7 | 2004-01-05 | 17622 | $256.44 | $21.16 | $277.60 |
| 8 | 2004-01-18 | 8331 | $67.85 | $5.60 | $73.45 |
| 9 | 2004-01-06 | 14902 | $29.82 | $2.46 | $32.28 |
| 10 | 2004-01-18 | 15112 | $20.78 | $1.71 | $22.49 |
Without getting into the new features, it's possible to generate interesting summary information. You might want to see how much you made each month, for example:
SELECT
EXTRACT(MONTH FROM orderdate) AS month,
EXTRACT(YEAR FROM orderdate) AS year,
SUM(totalamount)::text::money AS total
FROM orders
GROUP BY year, month
ORDER BY year, month;
| month | year | total |
|---|---|---|
| 1 | 2004 | $215,898.76 |
| 2 | 2004 | $216,792.09 |
| 3 | 2004 | $210,564.40 |
| 4 | 2004 | $216,642.01 |
| 5 | 2004 | $213,395.19 |
| 6 | 2004 | $216,412.59 |
| 7 | 2004 | $206,707.38 |
| 8 | 2004 | $210,115.92 |
| 9 | 2004 | $213,034.71 |
| 10 | 2004 | $211,952.17 |
| 11 | 2004 | $216,373.17 |
| 12 | 2004 | $219,498.40 |
What we're doing here is just grouping on some computed data, in this case, EXTRACT is giving us the information we need from the DATE to use it for grouping. An interesting fact about grouping is that it restricts you to relational calculations. You cannot, for example, do something different depending on where you are in the calculation. Grouping also applies to the whole statement, making it hard to make a complex report without using many subselects.
The window functions were created to support OLAP or Off-Line Analytical Processing, as opposed to OLTP or On-Line Transaction Processing which is what SQL RDBMSes are generally thought of. Despite this nobody has managed to give me an example of what OLAP is like in reality, so it must be a pretty specific and rare topic. However, the nutshell of windowing is that it lets you combine multiple different groupings in the same statement, or do calculations with respect to your current location in the generated result relation.
What if you want to also calculate the year-to-date earnings for each month? Without the window functions I'm not sure this is possible in straight SQL. What if you want to rank each month relative to the other months in terms of income?
Windowing essentially is a modification of an aggregate function with a new clause, OVER. There are some new functions too, like rank() which is essentially the row number. The OVER clause itself is built out of two optional expressions, an ORDER BY expression which is the same as the kind you already know, and a PARTITION BY which works much like GROUP BY with some extra features.
The rank of a given month is clearly related to the total amount made in that month, so its OVER clause will be ORDER BY total DESC since the largest amount should come first. Let's add that to the query:
SELECT
EXTRACT(MONTH FROM orderdate) AS month,
EXTRACT(YEAR FROM orderdate) AS year,
SUM(totalamount)::text::money AS total,
rank() OVER (ORDER BY total DESC)
FROM orders
GROUP BY year, month
ORDER BY year, month;
| month | year | total | rank |
|---|---|---|---|
| 1 | 2004 | $215,898.76 | 6 |
| 2 | 2004 | $216,792.09 | 2 |
| 3 | 2004 | $210,564.40 | 10 |
| 4 | 2004 | $216,642.01 | 3 |
| 5 | 2004 | $213,395.19 | 7 |
| 6 | 2004 | $216,412.59 | 4 |
| 7 | 2004 | $206,707.38 | 12 |
| 8 | 2004 | $210,115.92 | 11 |
| 9 | 2004 | $213,034.71 | 8 |
| 10 | 2004 | $211,952.17 | 9 |
| 11 | 2004 | $216,373.17 | 5 |
| 12 | 2004 | $219,498.40 | 1 |
This is already giving us some information we ordinarily would have had to infer or calculate on the other side. Now let's say you want to also see the earnings year-to-date for each month. Our first try isn't going to work:
SELECT
EXTRACT(MONTH FROM orderdate) AS month,
EXTRACT(YEAR FROM orderdate) AS year,
SUM(totalamount)::text::money AS total,
rank() OVER (ORDER BY total DESC),
SUM(total)::text::money OVER (ORDER BY year, month) AS ytd
FROM orders
GROUP BY year, month
ORDER BY year, month;
ERROR: column "total" does not exist
LINE 6: SUM(total)::text::money OVER (ORDER BY year, month) AS ytd
^
What this message is showing us is that our SUM(totalamount) AS total isn't visible inside the OVER clause of the other column. Fortunately, this is easy to work around by making the original query a subselect of the new query like so:
SELECT
*,
SUM(total) OVER (ORDER BY year, month)::text::money AS ytd
FROM
(SELECT
EXTRACT(MONTH FROM orderdate) AS month,
EXTRACT(YEAR FROM orderdate) AS year,
SUM(totalamount)::text::money AS total,
rank() OVER (ORDER BY total DESC)
FROM orders
GROUP BY year, month
ORDER BY year, month) AS q
ORDER BY year, month;
| month | year | total | rank | ytd |
|---|---|---|---|---|
| 1 | 2004 | $215,898.76 | 6 | $215,898.76 |
| 2 | 2004 | $216,792.09 | 2 | $432,690.85 |
| 3 | 2004 | $210,564.40 | 10 | $643,255.25 |
| 4 | 2004 | $216,642.01 | 3 | $859,897.26 |
| 5 | 2004 | $213,395.19 | 7 | $1,073,292.45 |
| 6 | 2004 | $216,412.59 | 4 | $1,289,705.04 |
| 7 | 2004 | $206,707.38 | 12 | $1,496,412.42 |
| 8 | 2004 | $210,115.92 | 11 | $1,706,528.34 |
| 9 | 2004 | $213,034.71 | 8 | $1,919,563.05 |
| 10 | 2004 | $211,952.17 | 9 | $2,131,515.22 |
| 11 | 2004 | $216,373.17 | 5 | $2,347,888.39 |
| 12 | 2004 | $219,498.40 | 1 | $2,567,386.79 |
As an aside, you really shouldn't use the money type in PostgreSQL. I only used it here to make the formatting a little prettier. In general you should use NUMERIC(X,2) where X is a number larger than you expect to ever have in your database. If you have to deal with multiple countries (or might someday) you should also track the currency code and perhaps even the conversion rate to your native currency code on that particular day. It gets complex in a hurry.