Selecting rows holding group-wise maximum is a favorite problem of mine, but one which only rarely pops up. But for some reason, after my last blog post on the subject, it seems to be mentioned almost daily around here.
Something that I forgot to mention in the previous post is that most of the examples there assume suitable indexing is available to get decent performance. Basically a composite index on both the column(s) in the GROUP BY and the column over which MAX is computed is needed. In the example I gave, such an index is available throught the primary key.
However, such an index may not be available in all cases. Maybe maintaining it would be too expensive, or maybe the data the max is computed over is itself the result of a (sub-)query, and no indexing is available. So it is worth it also to understand this case, as the performance of the different possible queries differ greatly from the indexed case.
So let us modify the original example to not have any useful indexes:
CREATE TABLE object_versions ( id INT PRIMARY KEY AUTO_INCREMENT, object_id INT NOT NULL, version INT NOT NULL, data VARCHAR(1000) ) ENGINE=InnoDB;
This time, I will use a data set of size only 1% of the previous example, as without indexes some of the queries get ridiculously poor performance. So let us take 10,000 rows, 1000 object each with 10 versions. I use this Perl long^H^H^H^Hone-liner to load the data:
perl -MDBI -le '$vers=10; $groups=1000; $dbh=DBI->connect("dbi:mysql:", "test", "pass", {RaiseError => 1}); $dbh->do("USE test"); foreach $o (1..$groups) { $dbh->do("INSERT INTO object_versions VALUES " . join(", ", map("(null, ?,?,?)", 1..$vers)), undef, map( ($o, $_, "data_${o}_$_"), 1..$vers)); }'
(Yes, I know… but I have a strange love for Perl one-liners).
Here are the results:
mysql> SELECT data FROM object_versions o1 WHERE version = (SELECT MAX(version) FROM object_versions o2 WHERE o1.object_id = o2.object_id); 1000 rows in set (25.55 sec) mysql> SELECT o1.data FROM object_versions o1 INNER JOIN (SELECT object_id, MAX(version) AS version FROM object_versions GROUP BY object_id) o2 ON (o1.object_id = o2.object_id AND o1.version = o2.version); 1000 rows in set (0.72 sec) mysql> SELECT o1.data FROM object_versions o1 LEFT JOIN object_versions o2 ON o1.object_id = o2.object_id AND o1.version < o2.version WHERE o2.object_id IS NULL; 1000 rows in set (15.44 sec) mysql> SELECT MAX(CONCAT(version, ":", data)) FROM object_versions GROUP BY object_id; 1000 rows in set (0.52 sec) mysql> SELECT data FROM (SELECT * FROM object_versions ORDER BY object_id DESC, version DESC) t GROUP BY object_id; 1000 rows in set (0.04 sec)
The only query that has any kind of decent performance here is last one using the “evil trick” of abusing the MySQL GROUP BY extensions in a way that is explicitly documented to not produce well-defined results. Which is really sad, since it is the only way I know of of getting the database to make the obvious execution plan in this case, which is to simply sort the data on the GROUP BY expression, and then loop over the rows picking the max row in each group on the way.
In fact, in many cases I think a decent alternative is to just select all rows into the client using ORDER BY, and do the aggregation there.
Now I just need someone to implement my SELECT MAX(object_id, version)
…