TIL: Finding a row with the max of multiple columns in Postgresql |

Part of my job is desigining helper applications around a set of data in a database designed by a third-party software vendor. The vendor’s database design is … interesting to say the least; many of the records are versioned, so a common challenge is to find the newest record for a given entity.

This isn’t always as easy as it sounds, because determining “newest” often requires looking at multiple columns. To understand, consider the following (contrived) example:

-- TABLE values

 collection | book | chapter | line | value 
------------+------+---------+------+-------
       1234 |    3 |       2 |   34 | fizz
       1234 |    1 |      17 |   23 | blah
       1235 |    2 |      12 |    7 | quuz
       1235 |    7 |       8 |   44 | qaax
       1235 |    7 |       9 |    2 | fooz

In this table, we have values that are found within lines, within chapters, within books, within a collection. This might be entries in some kind of deed book or log, for example. Suppose we want to generate a table containing only the newest value – that is, the value found in the last line of the last chapter of the last book in each collection. Correct results would be:

 collection | value 
------------+-------
       1234 | fizz
       1235 | fooz

A first approach

At first blush, this seems simple, and you might be tempted to try something like this:

SELECT collection, value 
FROM values v  
WHERE book=(SELECT MAX(book) FROM values WHERE collection = v.collection) AND 
      chapter=(SELECT MAX(chapter) FROM values WHERE collection = v.collection) AND
      line = (SELECT MAX(line) FROM values WHERE collection = v.collection)

Problem is, this doesn’t work; in fact it returns no rows. Why?

Consider collection 1234:

MAX(book) is 3
MAX(chapter) is 17
MAX(line) is 34

…which means we’re looking for a record with book=1, chapter=17, and line=34. No such record exists.

One possible approach to the problem is a “fractal” approach, in which each subsequent WHERE subclause needs to include the previous ones:

SELECT collection, value 
FROM values v  
WHERE book=(SELECT MAX(book) FROM values WHERE collection = v.collection) 
      AND chapter=(SELECT MAX(chapter) 
		   FROM values WHERE collection = v.collection 
		   AND book=(SELECT MAX(book) 
			     FROM values WHERE collection = v.collection)) 
      AND line = (SELECT MAX(line) 
		  FROM values WHERE collection = v.collection 
		    AND chapter = (SELECT MAX(chapter) 
				   FROM values WHERE collection = v.collection 
				    AND book=(SELECT MAX(book) 
					      FROM values 
					      WHERE collection = v.collection))
				   AND book = (SELECT MAX(book) 
					       FROM values 
					       WHERE collection= v.collection))

YUCK! What a mess! It does work, but it’s confusing and tedious to write; can you imagine this query if there were four or five fields we need to order? Or if values was some complex subquery rather than a table? I’ve had to write queries like this on other database systems where this dataset has lived, and they’re some of the most hideous I’ve ever written. It’s bad enough with a simple example table, but in a real query situation it’s enough to make your head spin.

A different approach

If you only need to find the latest value for a single collection, there’s a much easier trick:

SELECT collection, value 
FROM values 
WHERE collection=1234
ORDER BY book DESC, chapter DESC, line DESC 
LIMIT 1;

ORDER BY works exactly how we want it to in this case: it orders first by book, then by chapters within books, then by lines within chapters. By ordering in descending order and just snipping off the first result, we end up with just the newest value.

Unfortunately, we can’t use this approach to build a table of results for all collections. Or can we?

Enter window functions

Window functions¹ are a little-used feature of Postgresql that allow you to apply aggregate functions over groups of records based on matching values in a column. In other words, if we wanted to tally a department’s salaries by department, we could SUM the salaries using a windowing function OVER a PARTITION defined by a department column. We can also tell the window function how to order the records within each department section, so we could (for example) tally the salaries progressively by pay grade.

Or, to apply this to our values table, we can create a function to aggregate value over records grouped by collection in which rows are ordered by book, chapter, and line. It might look like this:

SELECT DISTINCT collection, 
       FIRST(value) OVER (PARTITION BY collection 
			  ORDER BY book DESC, chapter DESC, line DESC) 
FROM values;

What this query does is group (PARTITION) our records by collection, order each group as we specified in our OVER clause, and then run the FIRST aggregate function on the value field in each group (FIRST is an aggregate function that gives us the first record in the series). Unlike a similar GROUP BY operation, this still gives us a row for each row in the original table; but since the aggregated value will be the same for each group, a DISTINCT will reduce us down to one row per collection.

There’s only one real problem with this query: FIRST doesn’t exist (sorry copy-pasters).

As nice as it would have been, there’s no aggregate function that will return the first record in a series. Is there a way around this?

Enter the ARRAY

Array types in a database may at first just seem like a dangerous tool for thwarting normalization, but there are situations where they come in quite handy. This is one of them. Postgresql has an array_agg() aggregate function that, well, aggregates values into an array. Thus, the query:

SELECT collection, array_agg(value) FROM values GROUP BY collection;

Returns:

 collection |    array_agg     
------------+------------------
       1234 | {fizz,blah}
       1235 | {quuz,qaax,fooz}

Now, by using a window function instead of a GROUP BY, we can control the order that the values end up in the array:

SELECT collection, 
       array_agg(value) 
	 OVER (PARTITION BY collection 
	       ORDER BY book DESC, chapter DESC, line DESC) 
FROM values;

This gives us:

 collection |    array_agg     
------------+------------------
       1234 | {fizz}
       1234 | {fizz,blah}
       1235 | {fooz}
       1235 | {fooz,qaax}
       1235 | {fooz,qaax,quuz}

Ah, now you can see that the first value in each array is the correct one. We just need to pop off that first value and add DISTINCT so that we only have one record per collection:

SELECT DISTINCT collection, 
       (
	array_agg(value) 
	OVER (PARTITION BY collection 
	      ORDER BY book DESC, chapter DESC, line DESC)
	)[1] 
FROM values;

Notice two things here:

Unlike every other array/list/tuple/collection syntax known to programmer-kind, Postgresql arrays start at 1, not 0. So the first value in the array is 1.
We need to put the whole window expression in parenthesis and add the subscript to that, because it’s the whole expression (not just array_agg(value)) that evaluates to the array.

This query gives us the correct output, and is considerably more approachable than the fractal approach. Not only that, but adding more ORDER BY columns is trivial.

A better way?

Is there a better way to approach a query like this? If you know of one, please comment. It’d be great to find something as elegant that didn’t rely on Postgresql-specific features like arrays.

Well, dear reader, I hope you aren’t dealing with the sort of hare-brained database designs that I have to deal with; but if you are, maybe they’re a little more manageable now. Cheers!

Footnotes:

See http://www.postgresql.org/docs/9.1/static/tutorial-window.html

7 Thoughts on “TIL: Finding a row with the max of multiple columns in Postgresql”

Andreas says:

July 3, 2014 at 5:03 pm

This can be done without the use of window functions.

SELECT DISTINCT ON (collection) value FROM values ORDER BY collection, book DESC, chapter DESC, line DESC;

Or for to the the arrays:

SELECT collection, array_agg(value collection, book DESC, chapter DESC, line DESC) FROM values GROUP BY collection;

1. Andreas says:
  
  July 3, 2014 at 5:04 pm
  
  The second querycontains a pretty bad typo. It should have been:
  
  SELECT collection, array_agg(value ORDER BY book DESC, chapter DESC, line DESC) FROM values GROUP BY collection;
  
2. Alan says:
  
  July 4, 2014 at 2:01 pm
  
  Well, you learn things every day. 🙂 Thanks for telling me about DISTINCT ON.
  
3. Alan says:
  
  July 7, 2014 at 7:40 pm
  
  I changed my query to DISTINCT ON and saw about a 33% reduction in query time. Definitely much better than my method!
  
Rukh says:

July 3, 2014 at 5:54 pm

The final query is very convoluted. A clearer way would have been:

SELECT
collection,
(array_agg(value ORDER BY book DESC, chapter DESC, line DESC))[1]
FROM values
GROUP BY collection;

But the are better ways to do this; for example DISTINCT ON (PG-only) and row_number filtering can be used as a more efficient alternative to the array_agg approach.

Jason Dusek says:

July 5, 2014 at 12:47 am

Maybe Postgres’s first_value can do what the hypothetical FIRST operator is supposed to do?

CREATE TABLE values (collection integer, book integer, chapter integer, line integer, value text); --- CREATE TABLE INSERT INTO values VALUES (1234, 3, 2, 34, 'fizz'), (1234, 1, 17, 23, 'blah'), (1235, 2, 12, 7, 'quuz'), (1235, 7, 8, 44, 'qaax'), (1235, 7, 9, 2, 'fooz'); --- INSERT 0 5 SELECT DISTINCT collection, first_value(value) OVER (PARTITION BY collection ORDER BY book DESC, chapter DESC, line DESC) FROM values; --- collection | first_value --- ------------+------------- --- 1234 | fizz --- 1235 | fooz

1. Alan says:
  
  July 5, 2014 at 9:52 am
  
  LOL… This is the last time I suggest in a blog that Postgresql lacks a feature. Thanks for telling me about this, will come in handy in the future.

TIL: Finding a row with the max of multiple columns in Postgresql