Part of my job is desigining helper applications around a set of data in a database designed by a third-party software vendor. The vendor’s database design is … interesting to say the least; many of the records are versioned, so a common challenge is to find the newest record for a given entity.
This isn’t always as easy as it sounds, because determining “newest” often requires looking at multiple columns. To understand, consider the following (contrived) example:
-- TABLE values collection | book | chapter | line | value ------------+------+---------+------+------- 1234 | 3 | 2 | 34 | fizz 1234 | 1 | 17 | 23 | blah 1235 | 2 | 12 | 7 | quuz 1235 | 7 | 8 | 44 | qaax 1235 | 7 | 9 | 2 | fooz
In this table, we have values that are found within lines, within chapters, within books, within a collection. This might be entries in some kind of deed book or log, for example. Suppose we want to generate a table containing only the newest value – that is, the value found in the last line of the last chapter of the last book in each collection. Correct results would be:
collection | value ------------+------- 1234 | fizz 1235 | fooz
A first approach
At first blush, this seems simple, and you might be tempted to try something like this:
SELECT collection, value FROM values v WHERE book=(SELECT MAX(book) FROM values WHERE collection = v.collection) AND chapter=(SELECT MAX(chapter) FROM values WHERE collection = v.collection) AND line = (SELECT MAX(line) FROM values WHERE collection = v.collection)
Problem is, this doesn’t work; in fact it returns no rows. Why?
Consider collection 1234:
MAX(book)
is 3MAX(chapter)
is 17MAX(line)
is 34
…which means we’re looking for a record with book=1
, chapter=17
, and line=34
. No such record exists.
One possible approach to the problem is a “fractal” approach, in which each subsequent WHERE
subclause needs to include the previous ones:
SELECT collection, value FROM values v WHERE book=(SELECT MAX(book) FROM values WHERE collection = v.collection) AND chapter=(SELECT MAX(chapter) FROM values WHERE collection = v.collection AND book=(SELECT MAX(book) FROM values WHERE collection = v.collection)) AND line = (SELECT MAX(line) FROM values WHERE collection = v.collection AND chapter = (SELECT MAX(chapter) FROM values WHERE collection = v.collection AND book=(SELECT MAX(book) FROM values WHERE collection = v.collection)) AND book = (SELECT MAX(book) FROM values WHERE collection= v.collection))
YUCK! What a mess! It does work, but it’s confusing and tedious to write; can you imagine this query if there were four or five fields we need to order? Or if values
was some complex subquery rather than a table? I’ve had to write queries like this on other database systems where this dataset has lived, and they’re some of the most hideous I’ve ever written. It’s bad enough with a simple example table, but in a real query situation it’s enough to make your head spin.
A different approach
If you only need to find the latest value for a single collection, there’s a much easier trick:
SELECT collection, value FROM values WHERE collection=1234 ORDER BY book DESC, chapter DESC, line DESC LIMIT 1;
ORDER BY
works exactly how we want it to in this case: it orders first by book
, then by chapters
within books
, then by lines
within chapters
. By ordering in descending order and just snipping off the first result, we end up with just the newest value
.
Unfortunately, we can’t use this approach to build a table of results for all collections. Or can we?
Enter window functions
Window functions1 are a little-used feature of Postgresql that allow you to apply aggregate functions over groups of records based on matching values in a column. In other words, if we wanted to tally a department’s salaries by department, we could SUM
the salaries using a windowing function OVER
a PARTITION
defined by a department column. We can also tell the window function how to order the records within each department section, so we could (for example) tally the salaries progressively by pay grade.
Or, to apply this to our values
table, we can create a function to aggregate value
over records grouped by collection
in which rows are ordered by book
, chapter
, and line
. It might look like this:
SELECT DISTINCT collection, FIRST(value) OVER (PARTITION BY collection ORDER BY book DESC, chapter DESC, line DESC) FROM values;
What this query does is group (PARTITION) our records by collection, order each group as we specified in our OVER
clause, and then run the FIRST
aggregate function on the value
field in each group (FIRST
is an aggregate function that gives us the first record in the series). Unlike a similar GROUP BY
operation, this still gives us a row for each row in the original table; but since the aggregated value will be the same for each group, a DISTINCT
will reduce us down to one row per collection
.
There’s only one real problem with this query: FIRST
doesn’t exist (sorry copy-pasters).
As nice as it would have been, there’s no aggregate function that will return the first record in a series. Is there a way around this?
Enter the ARRAY
Array types in a database may at first just seem like a dangerous tool for thwarting normalization, but there are situations where they come in quite handy. This is one of them. Postgresql has an array_agg()
aggregate function that, well, aggregates values into an array. Thus, the query:
SELECT collection, array_agg(value) FROM values GROUP BY collection;
Returns:
collection | array_agg ------------+------------------ 1234 | {fizz,blah} 1235 | {quuz,qaax,fooz}
Now, by using a window function instead of a GROUP BY
, we can control the order that the values end up in the array:
SELECT collection, array_agg(value) OVER (PARTITION BY collection ORDER BY book DESC, chapter DESC, line DESC) FROM values;
This gives us:
collection | array_agg ------------+------------------ 1234 | {fizz} 1234 | {fizz,blah} 1235 | {fooz} 1235 | {fooz,qaax} 1235 | {fooz,qaax,quuz}
Ah, now you can see that the first value in each array is the correct one. We just need to pop off that first value and add DISTINCT
so that we only have one record per collection:
SELECT DISTINCT collection, ( array_agg(value) OVER (PARTITION BY collection ORDER BY book DESC, chapter DESC, line DESC) )[1] FROM values;
Notice two things here:
- Unlike every other array/list/tuple/collection syntax known to programmer-kind, Postgresql arrays start at 1, not 0. So the first value in the array is 1.
- We need to put the whole window expression in parenthesis and add the subscript to that, because it’s the whole expression (not just
array_agg(value)
) that evaluates to the array.
This query gives us the correct output, and is considerably more approachable than the fractal approach. Not only that, but adding more ORDER BY
columns is trivial.
A better way?
Is there a better way to approach a query like this? If you know of one, please comment. It’d be great to find something as elegant that didn’t rely on Postgresql-specific features like arrays.
Well, dear reader, I hope you aren’t dealing with the sort of hare-brained database designs that I have to deal with; but if you are, maybe they’re a little more manageable now. Cheers!
This can be done without the use of window functions.
SELECT DISTINCT ON (collection) value FROM values ORDER BY collection, book DESC, chapter DESC, line DESC;
Or for to the the arrays:
SELECT collection, array_agg(value collection, book DESC, chapter DESC, line DESC) FROM values GROUP BY collection;
The second querycontains a pretty bad typo. It should have been:
SELECT collection, array_agg(value ORDER BY book DESC, chapter DESC, line DESC) FROM values GROUP BY collection;
Well, you learn things every day. 🙂 Thanks for telling me about DISTINCT ON.
I changed my query to DISTINCT ON and saw about a 33% reduction in query time. Definitely much better than my method!
The final query is very convoluted. A clearer way would have been:
SELECT
collection,
(array_agg(value ORDER BY book DESC, chapter DESC, line DESC))[1]
FROM values
GROUP BY collection;
But the are better ways to do this; for example DISTINCT ON (PG-only) and row_number filtering can be used as a more efficient alternative to the array_agg approach.
Maybe Postgres’s
first_value
can do what the hypotheticalFIRST
operator is supposed to do?CREATE TABLE values (collection integer,
book integer,
chapter integer,
line integer,
value text);
--- CREATE TABLE
INSERT INTO values VALUES (1234, 3, 2, 34, 'fizz'),
(1234, 1, 17, 23, 'blah'),
(1235, 2, 12, 7, 'quuz'),
(1235, 7, 8, 44, 'qaax'),
(1235, 7, 9, 2, 'fooz');
--- INSERT 0 5
SELECT DISTINCT
collection,
first_value(value) OVER (PARTITION BY collection
ORDER BY book DESC, chapter DESC, line DESC)
FROM values;
--- collection | first_value
--- ------------+-------------
--- 1234 | fizz
--- 1235 | fooz
LOL… This is the last time I suggest in a blog that Postgresql lacks a feature. Thanks for telling me about this, will come in handy in the future.