# TIL: Finding a row with the max of multiple columns in Postgresql

Part of my job is desigining helper applications around a set of data in a database designed by a third-party software vendor. The vendor’s database design is … interesting to say the least; many of the records are versioned, so a common challenge is to find the newest record for a given entity.

This isn’t always as easy as it sounds, because determining “newest” often requires looking at multiple columns. To understand, consider the following (contrived) example:

```-- TABLE values

collection | book | chapter | line | value
------------+------+---------+------+-------
1234 |    3 |       2 |   34 | fizz
1234 |    1 |      17 |   23 | blah
1235 |    2 |      12 |    7 | quuz
1235 |    7 |       8 |   44 | qaax
1235 |    7 |       9 |    2 | fooz
```

In this table, we have values that are found within lines, within chapters, within books, within a collection. This might be entries in some kind of deed book or log, for example. Suppose we want to generate a table containing only the newest value – that is, the value found in the last line of the last chapter of the last book in each collection. Correct results would be:

``` collection | value
------------+-------
1234 | fizz
1235 | fooz
```

## A first approach

At first blush, this seems simple, and you might be tempted to try something like this:

```SELECT collection, value
FROM values v
WHERE book=(SELECT MAX(book) FROM values WHERE collection = v.collection) AND
chapter=(SELECT MAX(chapter) FROM values WHERE collection = v.collection) AND
line = (SELECT MAX(line) FROM values WHERE collection = v.collection)
```

Problem is, this doesn’t work; in fact it returns no rows. Why?

Consider collection 1234:

• `MAX(book)` is 3
• `MAX(chapter)` is 17
• `MAX(line)` is 34

…which means we’re looking for a record with `book=1`, `chapter=17`, and `line=34`. No such record exists.

One possible approach to the problem is a “fractal” approach, in which each subsequent `WHERE` subclause needs to include the previous ones:

```SELECT collection, value
FROM values v
WHERE book=(SELECT MAX(book) FROM values WHERE collection = v.collection)
AND chapter=(SELECT MAX(chapter)
FROM values WHERE collection = v.collection
AND book=(SELECT MAX(book)
FROM values WHERE collection = v.collection))
AND line = (SELECT MAX(line)
FROM values WHERE collection = v.collection
AND chapter = (SELECT MAX(chapter)
FROM values WHERE collection = v.collection
AND book=(SELECT MAX(book)
FROM values
WHERE collection = v.collection))
AND book = (SELECT MAX(book)
FROM values
WHERE collection= v.collection))
```

YUCK! What a mess! It does work, but it’s confusing and tedious to write; can you imagine this query if there were four or five fields we need to order? Or if `values` was some complex subquery rather than a table? I’ve had to write queries like this on other database systems where this dataset has lived, and they’re some of the most hideous I’ve ever written. It’s bad enough with a simple example table, but in a real query situation it’s enough to make your head spin.

## A different approach

If you only need to find the latest value for a single collection, there’s a much easier trick:

```SELECT collection, value
FROM values
WHERE collection=1234
ORDER BY book DESC, chapter DESC, line DESC
LIMIT 1;
```

`ORDER BY` works exactly how we want it to in this case: it orders first by `book`, then by `chapters` within `books`, then by `lines` within `chapters`. By ordering in descending order and just snipping off the first result, we end up with just the newest `value`.

Unfortunately, we can’t use this approach to build a table of results for all collections. Or can we?

### Enter window functions

Window functions1 are a little-used feature of Postgresql that allow you to apply aggregate functions over groups of records based on matching values in a column. In other words, if we wanted to tally a department’s salaries by department, we could `SUM` the salaries using a windowing function `OVER` a `PARTITION` defined by a department column. We can also tell the window function how to order the records within each department section, so we could (for example) tally the salaries progressively by pay grade.

Or, to apply this to our `values` table, we can create a function to aggregate `value` over records grouped by `collection` in which rows are ordered by `book`, `chapter`, and `line`. It might look like this:

```SELECT DISTINCT collection,
FIRST(value) OVER (PARTITION BY collection
ORDER BY book DESC, chapter DESC, line DESC)
FROM values;
```

What this query does is group (PARTITION) our records by collection, order each group as we specified in our `OVER` clause, and then run the `FIRST` aggregate function on the `value` field in each group (`FIRST` is an aggregate function that gives us the first record in the series). Unlike a similar `GROUP BY` operation, this still gives us a row for each row in the original table; but since the aggregated value will be the same for each group, a `DISTINCT` will reduce us down to one row per `collection`.

There’s only one real problem with this query: `FIRST` doesn’t exist (sorry copy-pasters).

As nice as it would have been, there’s no aggregate function that will return the first record in a series. Is there a way around this?

### Enter the ARRAY

Array types in a database may at first just seem like a dangerous tool for thwarting normalization, but there are situations where they come in quite handy. This is one of them. Postgresql has an `array_agg()` aggregate function that, well, aggregates values into an array. Thus, the query:

```SELECT collection, array_agg(value) FROM values GROUP BY collection;
```

Returns:

``` collection |    array_agg
------------+------------------
1234 | {fizz,blah}
1235 | {quuz,qaax,fooz}
```

Now, by using a window function instead of a `GROUP BY`, we can control the order that the values end up in the array:

```SELECT collection,
array_agg(value)
OVER (PARTITION BY collection
ORDER BY book DESC, chapter DESC, line DESC)
FROM values;
```

This gives us:

``` collection |    array_agg
------------+------------------
1234 | {fizz}
1234 | {fizz,blah}
1235 | {fooz}
1235 | {fooz,qaax}
1235 | {fooz,qaax,quuz}
```

Ah, now you can see that the first value in each array is the correct one. We just need to pop off that first value and add `DISTINCT` so that we only have one record per collection:

```SELECT DISTINCT collection,
(
array_agg(value)
OVER (PARTITION BY collection
ORDER BY book DESC, chapter DESC, line DESC)
)
FROM values;
```

Notice two things here:

• Unlike every other array/list/tuple/collection syntax known to programmer-kind, Postgresql arrays start at 1, not 0. So the first value in the array is 1.
• We need to put the whole window expression in parenthesis and add the subscript to that, because it’s the whole expression (not just `array_agg(value)`) that evaluates to the array.

This query gives us the correct output, and is considerably more approachable than the fractal approach. Not only that, but adding more `ORDER BY` columns is trivial.

## A better way?

Is there a better way to approach a query like this? If you know of one, please comment. It’d be great to find something as elegant that didn’t rely on Postgresql-specific features like arrays.

Well, dear reader, I hope you aren’t dealing with the sort of hare-brained database designs that I have to deal with; but if you are, maybe they’re a little more manageable now. Cheers!

## Footnotes:

### 7 Thoughts on “TIL: Finding a row with the max of multiple columns in Postgresql”

1. Andreas says:

This can be done without the use of window functions.

SELECT DISTINCT ON (collection) value FROM values ORDER BY collection, book DESC, chapter DESC, line DESC;

Or for to the the arrays:

SELECT collection, array_agg(value collection, book DESC, chapter DESC, line DESC) FROM values GROUP BY collection;

1. Andreas says:

The second querycontains a pretty bad typo. It should have been:

SELECT collection, array_agg(value ORDER BY book DESC, chapter DESC, line DESC) FROM values GROUP BY collection;

2. Alan says:

Well, you learn things every day. 🙂 Thanks for telling me about DISTINCT ON.

3. Alan says:

I changed my query to DISTINCT ON and saw about a 33% reduction in query time. Definitely much better than my method!

2. Rukh says:

The final query is very convoluted. A clearer way would have been:

SELECT
collection,
(array_agg(value ORDER BY book DESC, chapter DESC, line DESC))
FROM values
GROUP BY collection;

But the are better ways to do this; for example DISTINCT ON (PG-only) and row_number filtering can be used as a more efficient alternative to the array_agg approach.

3. Jason Dusek says:

Maybe Postgres’s `first_value` can do what the hypothetical `FIRST` operator is supposed to do?

``` CREATE TABLE values (collection integer, book integer, chapter integer, line integer, value text); --- CREATE TABLE INSERT INTO values VALUES (1234, 3, 2, 34, 'fizz'), (1234, 1, 17, 23, 'blah'), (1235, 2, 12, 7, 'quuz'), (1235, 7, 8, 44, 'qaax'), (1235, 7, 9, 2, 'fooz'); --- INSERT 0 5 SELECT DISTINCT collection, first_value(value) OVER (PARTITION BY collection ORDER BY book DESC, chapter DESC, line DESC) FROM values; --- collection | first_value --- ------------+------------- --- 1234 | fizz --- 1235 | fooz ```

1. Alan says:

LOL… This is the last time I suggest in a blog that Postgresql lacks a feature. Thanks for telling me about this, will come in handy in the future.