Understanding Window Functions

There was SQL before window functions and SQL after window functions: that’s how powerful this tool is. Being that of a deal breaker unfortunately means that it can be quite hard to grasp the feature. This article aims at making it crystal clear so that you can begin using it today and are able to reason about it and recognize cases where you want to be using window functions.

*We see a part of the data as if through a little window*

The whole idea behind window functions is to allow you to process several values of the result set at a time: you see through the window some peer rows and are able to compute a single output value from them, much like when using an aggregate function.

It’s all about frames

Partitioning into different frames

Available window functions

Conclusion

It’s all about frames

PostgreSQL comes with plenty of features, one of them will be of great help here to get a better grasp at what’s happening with window functions. The first step we are going through here is understanding which data the function has access to. For each input row you have access to a frame of the data, and the first thing to understand here is that frame.

First, meet with array_agg, an aggregate function that will build an array for you. Let’s use this tool to understand window frames:

 select x, array_agg(x) over (order by x)
   from generate_series(1, 3) as t(x);

 x | array_agg 
---+-----------
 1 | {1}
 2 | {1,2}
 3 | {1,2,3}
(3 rows)

The array_agg column in the previous query output allows us to see the full exact content of the windowing we’re going to process. The window definition here is over (order by x) and actually means over (order by x rows between unbounded preceding and current row):

 select x,
        array_agg(x) over (order by x
                           rows between unbounded preceding
                                    and current row)
   from generate_series(1, 3) as t(x);

 x | array_agg 
---+-----------
 1 | {1}
 2 | {1,2}
 3 | {1,2,3}
(3 rows)

It’s possible to work with other kind of frame specifications too, as in the following examples:

select x,
       array_agg(x) over (rows between current row
                                   and unbounded following)
  from generate_series(1, 3) as t(x);

 x | array_agg 
---+-----------
 1 | {1,2,3}
 2 | {2,3}
 3 | {3}
(3 rows)

If no frame clause is used at all, then the default is to see the whole set of rows in each of them, which can be really useful if you want to compute sums and percentages for example:

select x,
       array_agg(x) over () as frame,
       sum(x) over () as sum,
       x::float/sum(x) over () as part
  from generate_series(1, 3) as t(x);

 x |  frame  | sum |       part        
---+---------+-----+-------------------
 1 | {1,2,3} |   6 | 0.166666666666667
 2 | {1,2,3} |   6 | 0.333333333333333
 3 | {1,2,3} |   6 |               0.5
(3 rows)

Did you know you could compute both the total sum of a column and the proportion of the current value against this total within a single SQL query? That’s the breakthrough we’re talking about now with window functions.

Partitioning into different frames

Other frames are possible to define when using the clause PARTITION BY. To understand the partition by window it is best to relate to a real world use case. Let’s use the historical record of motor racing data available publicly.

The database is available in a single download file for MySQL only. Once you have a local copy, use pgloader to have the data set in PostgreSQL, it’s a single command line (once you have created a f1db database):

$ createdb f1db
$ pgloader mysql://root@localhost/f1db pgsql:///f1db

Also you might want to have f1db in your search_path, which you can do with the following command:

ALTER DATABASE f1db SET search_path TO f1db, public;

If you want to try the query at home and modify it, consider fetching and loading the same data set.

So in the Ergast database we have a results table with results from all the known races. We pick a race that happened just before this article was first published:

-[ RECORD 1 ]-----------------------------------------------------
raceid    | 890
year      | 2013
round     | 10
circuitid | 11
name      | Hungarian Grand Prix
date      | 2013-07-28
time      | 12:00:00
url       | http://en.wikipedia.org/wiki/2013_Hungarian_Grand_Prix

And within that race we can now fetch the list of competing drivers in their position order (winner first), and also their ranking compared to other drivers from the same constructor in the race:

select surname,
         constructors.name,
         position,
         format('%s / %s',
                row_number()
                  over(partition by constructorid
                           order by position nulls last),

                count(*) over(partition by constructorid)
               )
            as "pos same constr"
    from      results
         join drivers using(driverid)
         join constructors using(constructorid)
   where raceid = 890
order by position;

The partition by frame allow us to see peer rows, here the rows from results where the constructorid is the same as the current row. We use that partition twice in the previous SQL query, in the format() call. The first time with the row_number() window function gives us the position in the race with respect to other drivers from the same constructor, and the second time with count(*) gives us how many drivers from the same constructor participated in the race:

    surname    |    name     | position | pos same constr 
---------------+-------------+----------+-----------------
 Hamilton      | Mercedes    |        1 | 1 / 2
 Räikkönen     | Lotus F1    |        2 | 1 / 2
 Vettel        | Red Bull    |        3 | 1 / 2
 Webber        | Red Bull    |        4 | 2 / 2
 Alonso        | Ferrari     |        5 | 1 / 2
 Grosjean      | Lotus F1    |        6 | 2 / 2
 Button        | McLaren     |        7 | 1 / 2
 Massa         | Ferrari     |        8 | 2 / 2
 Pérez         | McLaren     |        9 | 2 / 2
 Maldonado     | Williams    |       10 | 1 / 2
 Hülkenberg    | Sauber      |       11 | 1 / 2
 Vergne        | Toro Rosso  |       12 | 1 / 2
 Ricciardo     | Toro Rosso  |       13 | 2 / 2
 van der Garde | Caterham    |       14 | 1 / 2
 Pic           | Caterham    |       15 | 2 / 2
 Bianchi       | Marussia    |       16 | 1 / 2
 Chilton       | Marussia    |       17 | 2 / 2
 di Resta      | Force India |       18 | 1 / 2
 Rosberg       | Mercedes    |       19 | 2 / 2
 Bottas        | Williams    |        ⦱ | 2 / 2
 Sutil         | Force India |        ⦱ | 2 / 2
 Gutiérrez     | Sauber      |        ⦱ | 2 / 2
(22 rows)

Drivers who didn’t finish the race get a null position entry, that our psql setup displays as the ⦱ character, for convenience.

In a single SQL query we can obtain information from each driver in the race and add to that other information from the race as a whole. Remember that the window functions only happens after the where clause, so you only get to see rows from the available result set of the query.

Available window functions

Any and all aggregate function you already know can be used against a window frame rather than a grouping clause, so you can already go use sum, min, max, count, avg and the other you’re used to.

You might already know that it’s possible with PostgreSQL to use the CREATE AGGREGATE command to register your own custom aggregate. Any such custom aggregate can then be given a window frame definition to work against too.

As an exercize to the reader, implement a weighted average aggregate and use it against a table where you have at least three columns: a date, a weight and a measure, with several measures per day. Now compute your weighted average by applying your own aggregate to your data set, either in a grouping clause or a window frame.

PostgreSQL of course is included with built-in aggregate functions and a number of built-in window functions.

  select surname,
         position,
         row_number()
           over(order by fastestlapspeed::numeric)
           as "fastest",
         ntile(3) over w as "group",
         lag(surname, 1) over w as "previous",
         lead(surname, 1) over w as "next"
    from      results
         join drivers using(driverid)
   where raceid = 890
  window w as (order by position)
order by position;

In this example you can see that we are reusing the same window definition several times, so we’re giving it a name to simplify the SQL. In this query we are fetching for each driver its position in the results, its position in terms of fastest lap speed, a group number if we divide the drivers into a set of 4 groups thanks to the ntile function, the name of the previous driver who made it, and the name of the driver immediately next to the current one, thanks to the lag an lead functions:

    surname    | position | fastest | group |   previous    |     next      
---------------+----------+---------+-------+---------------+---------------
 Hamilton      |        1 |      20 |     1 | ⦱             | Räikkönen
 Räikkönen     |        2 |      17 |     1 | Hamilton      | Vettel
 Vettel        |        3 |      21 |     1 | Räikkönen     | Webber
 Webber        |        4 |      22 |     1 | Vettel        | Alonso
 Alonso        |        5 |      15 |     1 | Webber        | Grosjean
 Grosjean      |        6 |      16 |     1 | Alonso        | Button
 Button        |        7 |      12 |     1 | Grosjean      | Massa
 Massa         |        8 |      18 |     1 | Button        | Pérez
 Pérez         |        9 |      13 |     2 | Massa         | Maldonado
 Maldonado     |       10 |      14 |     2 | Pérez         | Hülkenberg
 Hülkenberg    |       11 |       9 |     2 | Maldonado     | Vergne
 Vergne        |       12 |      11 |     2 | Hülkenberg    | Ricciardo
 Ricciardo     |       13 |       8 |     2 | Vergne        | van der Garde
 van der Garde |       14 |       6 |     2 | Ricciardo     | Pic
 Pic           |       15 |       5 |     2 | van der Garde | Bianchi
 Bianchi       |       16 |       3 |     3 | Pic           | Chilton
 Chilton       |       17 |       4 |     3 | Bianchi       | di Resta
 di Resta      |       18 |      10 |     3 | Chilton       | Rosberg
 Rosberg       |       19 |      19 |     3 | di Resta      | Bottas
 Sutil         |        ⦱ |       2 |     3 | Gutiérrez     | ⦱
 Gutiérrez     |        ⦱ |       1 |     3 | Bottas        | Sutil
 Bottas        |        ⦱ |       7 |     3 | Rosberg       | Gutiérrez
(22 rows)

And we can see that the fastest lap speed is not as important as one could think, as both the two fastest drivers didn’t even finish the race. In SQL terms we also see that we can have two different orderings returned from the same query, and again we can poke at other rows.

Conclusion

PostgreSQL is YeSQL

The real magic of what’s called window functions is actually the frame of data they can see when using the OVER () clause and its PARTITION BY and ORDER BY and frame clauses.

You need to remember that the windowing clauses are always considered last in the query, meaning after the WHERE clause. You can only see in any frame rows that have been selected for output: e.g. it’s not directly possible to compute a percentage over values that you don’t want to display. You would need to use a subquery in that case.

For more concrete examples about the window functions usage, you can see some other of my blog posts, such as Make the Most ouf of SQL and Reset Counter that I just tagged as Window Functions Articles.

Understanding Window Functions

Table of Contents

It’s all about frames

Partitioning into different frames

Available window functions

Conclusion

Dimitri Fontaine