Getting sorted
I am always pleased when one of my colleagues volunteers to make a guest posting. Apart from the obvious benefit that I have less work to do, it does broaden the range of topics covered. A frequent guest blogger is Brooks Moses who always has something interesting to say. Today he is musing on sorting algorithms …
In my last guest post, I alluded to an interesting story about the sort routine in our Mentor Embedded Performance Library (aka “MEPL”). MEPL includes a couple of functions for sorting arrays of floating-point numbers, and of course we needed to make those go fast.
It’s traditional for discussions of sort algorithms to start out with a simple algorithm like insertion sort, where you go through the array from left to right, and for each element you move it to the left until it’s in order. Far be it from me to break tradition! This is of course painfully slow for anything above a few elements, because for arbitrary arrays it requires O(n2) comparisons:
- Sorting 16-element array with insertion sort: 830 cycles
- Sorting 256-element array with insertion sort: 104 thousand cycles
- Sorting 32,768-element array with insertion sort: 1.8 billion cycles
So, we’ll discard that quickly. The best possible comparison sorts are O(n log n), which is much faster. The standard algorithm for sorting things these days is Quicksort, and there are lots of good implementations around. There’s even a version of it in the C standard library (qsort), so let’s use that as a real reference to compare against.
- Sorting 16-element array with qsort: 2600 cycles
- Sorting 256-element array with qsort: 80 thousand cycles
- Sorting 32,768-element array with qsort: 16.9 million cycles
However, qsort is also going to be slow for this application, because it’s designed to be general-purpose and makes a function call-back for every comparison. It’s considerably faster to use an implementation that has the comparison “baked in”. A nice place to start is the one in LAPACK, since that’s already designed for floating-point numbers and it’s BSD-licensed. So, that’s where I started with the real implementation.
- Sorting 16-element array with LAPACK’s quicksort: 1200 cycles
- Sorting 256-element array with LAPACK’s quicksort: 36 thousand cycles
- Sorting 32,768-element array with LAPACK’s quicksort: 8.3 million cycles
That’s already a substantial improvement. But we can do better.
(And let’s ignore the tiny 16-element case for a while; I’ll get back to it at the end.)
Most of the time in a sort function is taken up with doing comparisons — about 500,000 of them for our 32,768-element array. On our Power processor, the floating-point comparison unit takes five cycles to complete, and, because each one determines a branch, they can’t be pipelined; we have to wait for each one to complete. By comparison, the integer comparison unit only takes one cycle. What if we could use it instead?
It’s not a completely crazy idea. Floating-point numbers are stored as the exponent followed by the mantissa, so positive floating-point numbers will actually compare correctly if we use the integer comparison unit on their bits. The problem is negative numbers, and there are two problems with them. First is that a true sign bit means a negative number, but the sign bit is stored as the highest bit so it would cause negative numbers to compare as larger than positive ones. Okay, so we can flip the sign bit and solve that. Second is that floating-point numbers are stored as sign and magnitude (rather than two’s complement), so a negative number with a larger magnitude will compare as larger than a negative number with smaller magnitude, which is backwards. We can fix that one pretty easily too, though; if it’s a negative number, we flip all the bits.
Thus, we can write a couple of simple functions that convert floating-point numbers into unsigned integers in such a way that the integer compare unit will give the “right” answer, and then convert them back afterwards. There are a couple of subtleties; we use a signed-integer right shift to fill the entire mask with copies of the sign bit, and to convert things back we have to remember that the sign bit got flipped when doing the “if it’s a negative number” part:
inline unsigned int float_to_bits(unsigned int x) { unsigned int mask = (((int) x) >> 31) | 0x80000000; return x ^ mask; } inline unsigned int bits_to_float(unsigned int x) { unsigned int mask = ~(((int) x) >> 31) | 0x80000000; return x ^ mask; }
And it turns out that using the integer comparison unit saves enough time that it’s actually worth the cost of making an extra pass at the start and end to do the conversions:
- Sorting 256-element array with integer-comparison quicksort: 24 thousand cycles
- Sorting 32,768-element array with integer-comparison quicksort: 5.6 million cycles
The real reason that I was thinking about mapping floating-point numbers to integers with the same ordering wasn’t to do a faster quicksort, though. It’s because there are even faster sort algorithms for integers.
Suppose we had an array of things tagged with 8-bit integers that we wanted to sort. One way to sort through them is to just compute where everything goes, rather than comparing them. Iterate through the array and count up the number of things tagged with 0, the number tagged with 1, and so forth. Then, we can compute that the things tagged with 0 should start at index 0, the ones tagged with 1 start immediately after those, and so forth. Then, we make another pass through the array, and actually move things. If an element has tag i, we put it at the start location for tag i, and increment that start location so the next element for tag i will go in the slot after it.
This has a couple of nice properties. The first is that takes O(N) operations. (As I mentioned, you can’t do a comparison sort in fewer than O(N log N) comparisons, but we’ve found a loophole because we’re not comparing elements!) The second is that it’s a “stable” sort; when elements have the same tag, they’ll end up in the same order that they started in.
That works nicely for 8-bit integers, but it’s not very practical even for 16-bit ones; we’d have to keep 216 tallys to figure out where everything goes. However, there’s another clever trick! Let’s sort all of the 16-bit integers by their lowest 8 bits first. Then, because it’s a stable sort, if we then sort them by their highest 8 bits, they will end up completely in order — all of the elements that have the same highest 8 bits are still in order by their lowest 8 bits after this second pass, because they were in that order before the pass and we didn’t reorder them.
The same idea works for larger integers, too; for 32-bit integers, we could do four passes sorting on sets of 8 bits from lowest to highest. (It turns out that it’s actually a little nicer to do the passes on sets of 10 or 11 bits, so you only need three passes instead of four, and the number of tallies isn’t too large.)
This idea is called a radix sort.
- Sorting 256-element array with radix sort: 40 thousand cycles
- Sorting 32,768-element array with radix sort: 1.8 million cycles
That’s a very nice speedup on the larger array. But what about the smaller one? Setting up all those tallies takes a lot of time, and so it’s actually slower than even the basic LAPACK quicksort. The conventional wisdom is that you shouldn’t use radix sorts for small arrays. Which is true, for a pure radix sort.
On the other hand, buried in a dusty corner of Knuth’s The Art of Computer Programming is another idea. Let’s go back to sorting with 8-bit passes; those are faster than the 11-bit passes. What happens if we just do a couple of 8-bit passes, to sort the values by their highest 16 bits? Since there are only 256 elements, it’s unlikely that very many of them will have the same highest 16 bits, unless they’re exactly equal or we have a really unusual data set. So, just sorting by the highest 16 bits should give us something that’s almost in order.
And remember that insertion sort we started with? It’s O(n2) for arbitrary arrays, yes — but for arrays that are almost in order, it only takes O(n) operations because it doesn’t have to move things around very much. So, if we follow our radix sort on the upper 16 bits with an insertion sort, we get a completely-ordered array, and we still get it in O(n) time for mid-sized arrays:
- Sorting 256-element array with partial-radix/insertion sort: 15 thousand cycles
Thus, what I actually put in MEPL was a compound algorithm that uses a pure radix sort for larger sizes, a combined radix sort and insertion sort on middle sizes, and an insertion sort on the very smallest sizes where its simplicity means it is actually fastest. Combine that with the usual sorts of tweaking to make the implementation faster, as I talked about in the previous post, and here are the final numbers (which really do have pretty much an O(n) scaling; you can check!):
- Sorting 16-element array with optimized MEPL sort: 820 cycles
- Sorting 256-element array with optimized MEPL sort: 13 thousand cycles
- Sorting 32,768-element array with optimized MEPL sort: 1.3 million cycles
Compare that to the numbers for the basic LAPACK quicksort, and it’s clear that just tweaking the implementation wouldn’t have gotten anything close to these results; the LAPACK implementation is already fairly well-optimized. It really pays to think about the algorithm first!