2015-06-10

Not a new topic, but one I've been mulling over recently.

Let's start with a basic sort operation in Perl over a series of `DateTime` objects:

my @sorted_dates = sort { DateTime->compare( $a, $b ) } @dates;

Perl 5.6 and earlier used quicksort, and after that used mergesort[1]. There's also a sort pragma[2] if you want to specify something else.

Now, quicksort has an average runtime of O(n * log(n) ), but there's a few terrible cases where it goes quadratic--O(n2). Quadratic is bad; the only good thing about quadratic is that it's not polynomial (that is, O(2n), where the exponent is variable rather than the base, which makes us start using phrases like "assume we have until the end of the Universe to finish this run . . . "). The one nice thing is that quicksort has a few friendly cases where it's O(n). With mergesort, almost everything is O(n * log(n) )--that is, giving up a few nice cases in exchange for never going quadratic.

What really is all this Big-Oh stuff? One way to think about it is that it's the number of times you loop over something. It doesn't tell you how long each iteration of the loop will take, just on how many times it will iterate. If you're walking a list, that's O(n), where *n* is the number of elements of a list. If you walk it twice, that's O(2n). For an algorithm that runs in O(n * log(n)), things will take longer than O(n) when n > 10 (because log(10) == 1).

(This is why a hash lookup is sometimes slower than a balanced binary tree. Hash lookups are constant time, O(1) (well, it is if we squint a little), and balanced binary trees are O(log(n)) worst case. But that constant time in hashes is very big, and the tree's loop is very tight. It doesn't take many entries for hashes to catch up, though.)

So when we look at the sort above, we notice that we'll want to keep expensive operations out of the subroutine we're passing to sort. That subroutine will be called O(n * log(n)) times (roughly speaking), so let's try to hoist expensive operations out of it.

This is exactly what the Schwartzian Transform does. Let's start by having a `map()` execute before the sort to pull out some easily sortable data from our DateTime objects. The Unix epoch will do nicely for this (use hires epoch if you need to):

my @sort_friendly_dates = map {[ $_->epoch, $_ ]} @dates;

Now we can modify our sort algorithm to pull out that first entry in the arrayref and do a straight numeric comparison:

my @sorted_dates = sort { $a->[0] <=> $b->[0] } @sort_friendly_dates;

Perl's spaceship (`<=>`) operator does a numeric sort, which will be faster than the lexicographical sorting of the `cmp` operator (which gets even more complicated if you're Tom Christiansen[3]). If you can sort based on numbers rather than strings, that's always better.

Now we want to get at our original object rather than the arrayrefs we created, so we do another `map()` to pull that out:

my @final_sorted_dates = map { $_->[1] } @sorted_dates;

Now let's combine all that together in a single statement to get the Schwartzian Transform as it's usually seen:

my @sorted_dates = map { $_->[1] } sort { $a->[0] <=> $b->[0] } map {[ $_->epoch, $_ ]} @dates;

This ends up being O(2n) for the two maps, plus O(n * log(n)) for the sort. But we're making that O(n * log(n)) cheaper per iteration, so it's a win overall if `@dates` is big enough.

Can we take this further? If we can collapse that arrayref into a straight string or number, then we can remove the subroutine passed to sort entirely. This means perl will do a lexicographic sort completely within its internals without the extra time taken to call a Perl subroutine (which more than makes up for the extra time of a lexicographic sort rather than straight numeric). This isn't always easy, but in this case we could do it by throwing away our DateTime objects entirely in the first `map()` (first by order of operations, that is), and then rebuilding it with the epoch data passed to the second `map()`:

my @sorted_dates = map { DateTime->from_epoch( epoch => $_ ) } sort map { DateTime->epoch } @dates;

This form is called the "Guttman-Rosler Transform". It may or may not be faster than an equivalent Schwartzian, depending on how expensive destructing/reconstructing the object is. That would also do all kinds of fucky things to runtimes with threaded garbage collectors, but it's not too bad for a refcounting language like Perl5. If you're serializing things into strings, care has to be taken that you'll get the output objects deserialized the same way on the flip side. Big-Oh is the same as the Schwartzian.

One thing that I think is overlooked about the Schwartzian is improving separation of concerns. The sort routine is not responsible for formatting data in a way that's sortable, and there's no need to copy/paste the same code for `$a` and `$b`. That's the sole job of the first `map()`. Once you recognize the idiom, your eyes can easily fall to the place where the data munging is happening. This makes the Schwartzian a nice idiom even when its performance improvement is marginal.