{"id":1031,"date":"2014-07-02T19:46:00","date_gmt":"2014-07-03T00:46:00","guid":{"rendered":"http:\/\/www.alandmoore.com\/blog\/?p=1031"},"modified":"2014-07-03T00:11:17","modified_gmt":"2014-07-03T05:11:17","slug":"til-finding-a-row-with-the-max-of-multiple-columns-in-postgresql","status":"publish","type":"post","link":"https:\/\/alandmoore.com\/blog\/2014\/07\/02\/til-finding-a-row-with-the-max-of-multiple-columns-in-postgresql\/","title":{"rendered":"TIL: Finding a row with the max of multiple columns in Postgresql"},"content":{"rendered":"<p> Part of my job is desigining helper applications around a set of data in a database designed by a third-party software vendor.  The vendor&#8217;s database design is &#x2026; interesting to say the least; many of the records are <i>versioned<\/i>, so a common challenge is to find the newest record for a given entity.<br \/>\n<!--more--><\/p>\n<p>This isn&#8217;t always as easy as it sounds, because determining &#8220;newest&#8221; often requires looking at multiple columns.  To understand, consider the following (contrived) example: <\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n-- TABLE values\r\n\r\n collection | book | chapter | line | value \r\n------------+------+---------+------+-------\r\n       1234 |    3 |       2 |   34 | fizz\r\n       1234 |    1 |      17 |   23 | blah\r\n       1235 |    2 |      12 |    7 | quuz\r\n       1235 |    7 |       8 |   44 | qaax\r\n       1235 |    7 |       9 |    2 | fooz\r\n<\/pre>\n<p> In this table, we have values that are found within lines, within chapters, within books, within a collection.  This might be entries in some kind of deed book or log, for example.  Suppose we want to generate a table containing only the newest value &#x2013; that is, the value found in the last line of the last chapter of the last book in each collection.  Correct results would be: <\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n collection | value \r\n------------+-------\r\n       1234 | fizz\r\n       1235 | fooz\r\n<\/pre>\n<div id=\"outline-container-sec-1\" class=\"outline-2\">\n<h2 id=\"sec-1\">A first approach<\/h2>\n<div class=\"outline-text-2\" id=\"text-1\">\n<p> At first blush, this seems simple, and you might be tempted to try something like this: <\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nSELECT collection, value \r\nFROM values v  \r\nWHERE book=(SELECT MAX(book) FROM values WHERE collection = v.collection) AND \r\n      chapter=(SELECT MAX(chapter) FROM values WHERE collection = v.collection) AND\r\n      line = (SELECT MAX(line) FROM values WHERE collection = v.collection)\r\n<\/pre>\n<p> Problem is, this doesn&#8217;t work; in fact it returns no rows.  Why? <\/p>\n<p> Consider collection 1234: <\/p>\n<ul class=\"org-ul\">\n<li><code>MAX(book)<\/code> is 3\n<\/li>\n<li><code>MAX(chapter)<\/code> is 17\n<\/li>\n<li><code>MAX(line)<\/code> is 34\n<\/li>\n<\/ul>\n<p> &#x2026;which means we&#8217;re looking for a record with <code>book=1<\/code>, <code>chapter=17<\/code>, and <code>line=34<\/code>.  No such record exists. <\/p>\n<p> One possible approach to the problem is a &#8220;fractal&#8221; approach, in which each subsequent <code>WHERE<\/code> subclause needs to include the previous ones: <\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nSELECT collection, value \r\nFROM values v  \r\nWHERE book=(SELECT MAX(book) FROM values WHERE collection = v.collection) \r\n      AND chapter=(SELECT MAX(chapter) \r\n\t\t   FROM values WHERE collection = v.collection \r\n\t\t   AND book=(SELECT MAX(book) \r\n\t\t\t     FROM values WHERE collection = v.collection)) \r\n      AND line = (SELECT MAX(line) \r\n\t\t  FROM values WHERE collection = v.collection \r\n\t\t    AND chapter = (SELECT MAX(chapter) \r\n\t\t\t\t   FROM values WHERE collection = v.collection \r\n\t\t\t\t    AND book=(SELECT MAX(book) \r\n\t\t\t\t\t      FROM values \r\n\t\t\t\t\t      WHERE collection = v.collection))\r\n\t\t\t\t   AND book = (SELECT MAX(book) \r\n\t\t\t\t\t       FROM values \r\n\t\t\t\t\t       WHERE collection= v.collection))\r\n<\/pre>\n<p> YUCK!  What a mess!  It <i>does<\/i> work, but it&#8217;s confusing and tedious to write; can you imagine this query if there were four or five fields we need to order?  Or if <code>values<\/code> was some complex subquery rather than a table?  I&#8217;ve had to write queries like this on other database systems where this dataset has lived, and they&#8217;re some of the most hideous I&#8217;ve ever written.  It&#8217;s bad enough with a simple example table, but in a real query situation it&#8217;s enough to make your head spin. <\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-sec-2\" class=\"outline-2\">\n<h2 id=\"sec-2\">A different approach<\/h2>\n<div class=\"outline-text-2\" id=\"text-2\">\n<p> If you only need to find the latest value for a <i>single collection<\/i>, there&#8217;s a much easier trick: <\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nSELECT collection, value \r\nFROM values \r\nWHERE collection=1234\r\nORDER BY book DESC, chapter DESC, line DESC \r\nLIMIT 1;\r\n<\/pre>\n<p> <code>ORDER BY<\/code> works exactly how we want it to in this case:  it orders first by <code>book<\/code>, then by <code>chapters<\/code> within <code>books<\/code>, then by <code>lines<\/code> within <code>chapters<\/code>.  By ordering in descending order and just snipping off the first result, we end up with just the newest <code>value<\/code>. <\/p>\n<p> Unfortunately, we can&#8217;t use this approach to build a table of results for all collections.   Or can we? <\/p>\n<\/div>\n<div id=\"outline-container-sec-2-1\" class=\"outline-3\">\n<h3 id=\"sec-2-1\">Enter window functions<\/h3>\n<div class=\"outline-text-3\" id=\"text-2-1\">\n<p> Window functions<sup><a id=\"fnr.1\" name=\"fnr.1\" class=\"footref\" href=\"#fn.1\">1<\/a><\/sup> are a little-used feature of Postgresql that allow you to apply aggregate functions over groups of records based on matching values in a column.  In other words, if we wanted to tally a department&#8217;s salaries by department, we could <code>SUM<\/code> the salaries using a windowing function <code>OVER<\/code> a <code>PARTITION<\/code> defined by a department column.  We can also tell the window function how to order the records within each department section, so we could (for example) tally the salaries progressively by pay grade. <\/p>\n<p> Or, to apply this to our <code>values<\/code> table, we can create a function to aggregate <code>value<\/code> over records grouped by <code>collection<\/code> in which rows are ordered by <code>book<\/code>, <code>chapter<\/code>, and <code>line<\/code>.  It might look like this: <\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nSELECT DISTINCT collection, \r\n       FIRST(value) OVER (PARTITION BY collection \r\n\t\t\t  ORDER BY book DESC, chapter DESC, line DESC) \r\nFROM values;\r\n<\/pre>\n<p> What this query does is group (PARTITION) our records by collection, order each group as we specified in our <code>OVER<\/code> clause, and then run the <code>FIRST<\/code> aggregate function on the <code>value<\/code> field in each group (<code>FIRST<\/code> is an aggregate function that gives us the first record in the series).  Unlike a similar <code>GROUP BY<\/code> operation, this still gives us a row for each row in the original table; but since the aggregated value will be the same for each group, a <code>DISTINCT<\/code> will reduce us down to one row per <code>collection<\/code>. <\/p>\n<p> There&#8217;s only one real problem with this query:  <code>FIRST<\/code> doesn&#8217;t exist (sorry copy-pasters). <\/p>\n<p> As nice as it would have been, there&#8217;s no aggregate function that will return the first record in a series.  Is there a way around this? <\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-sec-2-2\" class=\"outline-3\">\n<h3 id=\"sec-2-2\">Enter the ARRAY<\/h3>\n<div class=\"outline-text-3\" id=\"text-2-2\">\n<p> Array types in a database may at first just seem like a dangerous tool for thwarting normalization, but there are situations where they come in quite handy.  This is one of them.  Postgresql has an <code>array_agg()<\/code> aggregate function that, well, aggregates values into an array.  Thus, the query: <\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nSELECT collection, array_agg(value) FROM values GROUP BY collection;\r\n<\/pre>\n<p> Returns: <\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n collection |    array_agg     \r\n------------+------------------\r\n       1234 | {fizz,blah}\r\n       1235 | {quuz,qaax,fooz}\r\n<\/pre>\n<p> Now, by using a window function instead of a <code>GROUP BY<\/code>, we can control the order that the values end up in the array: <\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nSELECT collection, \r\n       array_agg(value) \r\n\t OVER (PARTITION BY collection \r\n\t       ORDER BY book DESC, chapter DESC, line DESC) \r\nFROM values;\r\n<\/pre>\n<p> This gives us: <\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n collection |    array_agg     \r\n------------+------------------\r\n       1234 | {fizz}\r\n       1234 | {fizz,blah}\r\n       1235 | {fooz}\r\n       1235 | {fooz,qaax}\r\n       1235 | {fooz,qaax,quuz}\r\n<\/pre>\n<p> Ah, now you can see that the first value in each array is the correct one.  We just need to pop off that first value and add <code>DISTINCT<\/code> so that we only have one record per collection: <\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nSELECT DISTINCT collection, \r\n       (\r\n\tarray_agg(value) \r\n\tOVER (PARTITION BY collection \r\n\t      ORDER BY book DESC, chapter DESC, line DESC)\r\n\t)&#x5B;1] \r\nFROM values;\r\n<\/pre>\n<p> Notice two things here: <\/p>\n<ul class=\"org-ul\">\n<li>Unlike every other array\/list\/tuple\/collection syntax known to programmer-kind, Postgresql arrays start at 1, not 0.  So the first value in the array is 1.\n<\/li>\n<li>We need to put the whole window expression in parenthesis and add the subscript to that, because it&#8217;s the whole expression (not just <code>array_agg(value)<\/code>) that evaluates to the array.\n<\/li>\n<\/ul>\n<p> This query gives us the correct output, and is considerably more approachable than the fractal approach.  Not only that, but adding more <code>ORDER BY<\/code> columns is trivial. <\/p>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-sec-3\" class=\"outline-2\">\n<h2 id=\"sec-3\">A better way?<\/h2>\n<div class=\"outline-text-2\" id=\"text-3\">\n<p> Is there a better way to approach a query like this?  If you know of one, please comment.  It&#8217;d be great to find something as elegant that didn&#8217;t rely on Postgresql-specific features like arrays. <\/p>\n<p> Well, dear reader, I hope you aren&#8217;t dealing with the sort of hare-brained database designs that I have to deal with; but if you are, maybe they&#8217;re a little more manageable now.  Cheers! <\/p>\n<\/div>\n<\/div>\n<div id=\"footnotes\">\n<h2 class=\"footnotes\">Footnotes: <\/h2>\n<div id=\"text-footnotes\">\n<div class=\"footdef\"><sup><a id=\"fn.1\" name=\"fn.1\" class=\"footnum\" href=\"#fnr.1\">1<\/a><\/sup> <\/p>\n<p>See <a href=\"http:\/\/www.postgresql.org\/docs\/9.1\/static\/tutorial-window.html\">http:\/\/www.postgresql.org\/docs\/9.1\/static\/tutorial-window.html<\/a><\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Today I figured out a trick in Postgresql to help me deal with tables whose records are versioned by multiple columns.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[55,57,28,83,81],"class_list":["post-1031","post","type-post","status-publish","format-standard","hentry","category-technology","tag-database","tag-postgresql","tag-programming-2","tag-sql","tag-til"],"_links":{"self":[{"href":"https:\/\/alandmoore.com\/blog\/wp-json\/wp\/v2\/posts\/1031","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/alandmoore.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/alandmoore.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/alandmoore.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/alandmoore.com\/blog\/wp-json\/wp\/v2\/comments?post=1031"}],"version-history":[{"count":13,"href":"https:\/\/alandmoore.com\/blog\/wp-json\/wp\/v2\/posts\/1031\/revisions"}],"predecessor-version":[{"id":1059,"href":"https:\/\/alandmoore.com\/blog\/wp-json\/wp\/v2\/posts\/1031\/revisions\/1059"}],"wp:attachment":[{"href":"https:\/\/alandmoore.com\/blog\/wp-json\/wp\/v2\/media?parent=1031"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/alandmoore.com\/blog\/wp-json\/wp\/v2\/categories?post=1031"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/alandmoore.com\/blog\/wp-json\/wp\/v2\/tags?post=1031"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}