PostgreSQL 8.4: Windowing Functions

Posted by Daniel Lyons on August 12, 2009

PostgreSQL 8.4 is the first version to introduce OLAP functionality in the form of the windowing functions. These functions are new to me too but they look quite interesting and powerful. They will greatly simplify combining many large SQL statements into simpler ones for complex reporting, as well as make it possible to push some "fixup" processing that depends on ordering back into the database.

As an example, I'm going to be using the orders table from the (Dell store example database)[http://pgfoundry.org/projects/dbsamples/]. A few rows from this table look like this:

        SELECT 
          orderid, orderdate, customerid, 
          netamount::text::money, 
          tax::text::money, 
          totalamount::text::money 
        FROM orders LIMIT 10;
        
orderid orderdate customerid netamount tax totalamount
1 2004-01-27 7888 $313.24 $25.84 $339.08
2 2004-01-01 4858 $54.90 $4.53 $59.43
3 2004-01-17 15399 $160.10 $13.21 $173.31
4 2004-01-28 17019 $106.67 $8.80 $115.47
5 2004-01-09 14771 $256.00 $21.12 $277.12
6 2004-01-11 13734 $382.59 $31.56 $414.15
7 2004-01-05 17622 $256.44 $21.16 $277.60
8 2004-01-18 8331 $67.85 $5.60 $73.45
9 2004-01-06 14902 $29.82 $2.46 $32.28
10 2004-01-18 15112 $20.78 $1.71 $22.49

Without getting into the new features, it's possible to generate interesting summary information. You might want to see how much you made each month, for example:

        SELECT
          EXTRACT(MONTH FROM orderdate) AS month, 
          EXTRACT(YEAR FROM orderdate) AS year, 
          SUM(totalamount)::text::money AS total 
        FROM orders 
        GROUP BY year, month 
        ORDER BY year, month;
        
month year total
1 2004 $215,898.76
2 2004 $216,792.09
3 2004 $210,564.40
4 2004 $216,642.01
5 2004 $213,395.19
6 2004 $216,412.59
7 2004 $206,707.38
8 2004 $210,115.92
9 2004 $213,034.71
10 2004 $211,952.17
11 2004 $216,373.17
12 2004 $219,498.40

What we're doing here is just grouping on some computed data, in this case, EXTRACT is giving us the information we need from the DATE to use it for grouping. An interesting fact about grouping is that it restricts you to relational calculations. You cannot, for example, do something different depending on where you are in the calculation. Grouping also applies to the whole statement, making it hard to make a complex report without using many subselects.

The window functions were created to support OLAP or Off-Line Analytical Processing, as opposed to OLTP or On-Line Transaction Processing which is what SQL RDBMSes are generally thought of. Despite this nobody has managed to give me an example of what OLAP is like in reality, so it must be a pretty specific and rare topic. However, the nutshell of windowing is that it lets you combine multiple different groupings in the same statement, or do calculations with respect to your current location in the generated result relation.

What if you want to also calculate the year-to-date earnings for each month? Without the window functions I'm not sure this is possible in straight SQL. What if you want to rank each month relative to the other months in terms of income?

Windowing essentially is a modification of an aggregate function with a new clause, OVER. There are some new functions too, like rank() which is essentially the row number. The OVER clause itself is built out of two optional expressions, an ORDER BY expression which is the same as the kind you already know, and a PARTITION BY which works much like GROUP BY with some extra features.

The rank of a given month is clearly related to the total amount made in that month, so its OVER clause will be ORDER BY total DESC since the largest amount should come first. Let's add that to the query:

        SELECT 
          EXTRACT(MONTH FROM orderdate) AS month,
          EXTRACT(YEAR FROM orderdate) AS year, 
          SUM(totalamount)::text::money AS total,
          rank() OVER (ORDER BY total DESC)
        FROM orders
        GROUP BY year, month 
        ORDER BY year, month;
        
month year total rank
1 2004 $215,898.76 6
2 2004 $216,792.09 2
3 2004 $210,564.40 10
4 2004 $216,642.01 3
5 2004 $213,395.19 7
6 2004 $216,412.59 4
7 2004 $206,707.38 12
8 2004 $210,115.92 11
9 2004 $213,034.71 8
10 2004 $211,952.17 9
11 2004 $216,373.17 5
12 2004 $219,498.40 1

This is already giving us some information we ordinarily would have had to infer or calculate on the other side. Now let's say you want to also see the earnings year-to-date for each month. Our first try isn't going to work:

        SELECT 
          EXTRACT(MONTH FROM orderdate) AS month,
          EXTRACT(YEAR FROM orderdate) AS year, 
          SUM(totalamount)::text::money AS total,
          rank() OVER (ORDER BY total DESC),
          SUM(total)::text::money OVER (ORDER BY year, month) AS ytd
        FROM orders
        GROUP BY year, month 
        ORDER BY year, month;
        

        ERROR:  column "total" does not exist
        LINE 6:   SUM(total)::text::money OVER (ORDER BY year, month) AS ytd
                      ^
        

What this message is showing us is that our SUM(totalamount) AS total isn't visible inside the OVER clause of the other column. Fortunately, this is easy to work around by making the original query a subselect of the new query like so:

        SELECT
          *,
          SUM(total) OVER (ORDER BY year, month)::text::money AS ytd
        FROM
          (SELECT 
             EXTRACT(MONTH FROM orderdate) AS month,
             EXTRACT(YEAR FROM orderdate) AS year, 
             SUM(totalamount)::text::money AS total,
             rank() OVER (ORDER BY total DESC)
           FROM orders
           GROUP BY year, month 
           ORDER BY year, month) AS q
        ORDER BY year, month;
        

month year total rank ytd
1 2004 $215,898.76 6 $215,898.76
2 2004 $216,792.09 2 $432,690.85
3 2004 $210,564.40 10 $643,255.25
4 2004 $216,642.01 3 $859,897.26
5 2004 $213,395.19 7 $1,073,292.45
6 2004 $216,412.59 4 $1,289,705.04
7 2004 $206,707.38 12 $1,496,412.42
8 2004 $210,115.92 11 $1,706,528.34
9 2004 $213,034.71 8 $1,919,563.05
10 2004 $211,952.17 9 $2,131,515.22
11 2004 $216,373.17 5 $2,347,888.39
12 2004 $219,498.40 1 $2,567,386.79

As an aside, you really shouldn't use the money type in PostgreSQL. I only used it here to make the formatting a little prettier. In general you should use NUMERIC(X,2) where X is a number larger than you expect to ever have in your database. If you have to deal with multiple countries (or might someday) you should also track the currency code and perhaps even the conversion rate to your native currency code on that particular day. It gets complex in a hurry.