Now we’re going to take a completely different approach to sorting. Generally, most sorting algorithms are data dependent. Based on the values of the data, we choose to do different things. And if we sort 2 different sequences, we’d probably take a different path through the code. Instead, now we’re going to consider a set of sorting algorithms that we call oblivious. No matter what input we choose, the sorting algorithm proceeds the exact same way. In fact, the only data dependence we will have in the whole algorithm is a swap operation that inputs 2 elements and outputs them in the correct order. Let me take a brief digression. Why is an oblivious algorithm a good algorithm for a parallel processor like a GPU? So when I talk about an oblivious algorithm, what I mean is that its behavior is independent of some particular aspect of the problem. In this case we’re talking about a sorting algorithm that always does the exact same steps no matter what the input is. Good CPU sorting algorithms are generally more clever. They have more complex control flow and a lot of data-dependent decisions to run fast. GPUs aren’t so great at complex control flow. Instead, they’re great at simple control flow and massive parallelism, and oblivious algorithms are generally a good match for massively parallel approaches to problems. Okay. Let’s return to sorting. Clearly, an example will help show what I mean. This structure that I’m showing here is called a sorting network. The input is placed on the lines at the left. So this will input 4 elements–1, 2, 3, 4. Each time 2 values are on either side of a shared vertical line, these values are swapped if they are in the wrong order. So let’s put some numbers on there and give it a shot. So we’re going to start with the input sequence 4, 2, 1, 3. And so each time 2 elements are connected by 1 green line, we will swap them if they are out of order. So first we’ll swap 2 and 4. But we won’t swap 1 and 3 because they’re in the right order. Now we will look at 2 and 3, and we don’t have to swap them but we do have to swap 1 and 4. And finally we have 2 more swaps to do. 1 and 2 are in the wrong order, so we’ll swap them. 4 and 3 are also in the wrong order, so we will swap them. And, voila, now we’ve moved from an unsorted sequence to a sorted sequence. Fortunately, this bitonic sorting network scales in a straightforward and easily programmable way. So we had a little sorting network that would sort 4 items. It’s fairly straightforward to expand it so that now it can sort 8. So let me try to give you a little intuition about how that works. So a bitonic sequence is a sequence that only changes direction a maximum of once. So if we look at this sequence here, we’re going up for a little while and then down for a little while, but we only change direction right here. So this is bitonic. How about this sequence here? Well, we’re going down and then up. 753 goes down, then we change direction and go back up. So sort of the trace of the sequence looks like that and that is also bitonic. But we might have a sequence that looks like this–1324– where we go up and then down and then up again. This, however, is not bitonic. Now why do we care? It turns out that it’s particularly easy to sort a bitonic sequence, and let me tell you how. So let’s say we have this bitonic sequence or, alternatively, 2 monotonic sequences that we can turn into 1 bitonic sequence. Here is what we are going to do. We’re going to take the first half of this bitonic sequence and we’re going to lay it on top of the second half of this bitonic sequence. Then what we’re going to do is do a pairwise comparison of each element here. And we’re going to take the larger element and we’re going to put it in this set here, and we’re going to take the smaller element and we’re going to put it in this size here. So what we’ve done is we’ve partitioned this bitonic sequence into 2 bitonic sequences, one of which is bigger and one of which is smaller. And so then we can recurse and do the bitonic sort on each one of these subsequences and continue until we have a completely sorted sequence. The overall algorithm is generally to sort 2 sequences, we reverse 1, we append it to the other to create a bitonic sequence, we split that bitonic sequence into the larger and smaller halves, and then we sort each half. So if we look at this big picture here, these 2 boxes here sort to input sequences of size 4, this segment here splits 2 sorted sequences into a larger half and a smaller half, and then these 2 boxes here will sort 2 bitonic sequences that result for the smaller half and the bigger half. So if we actually did the analysis here, we would find that for an input set of size n it requires something proportional to log n stages overall. If we actually looked here, we would find that’s 1, 2, 3, 4, 5, 6. And the first stage compares and swaps all elements once, the next stage does 2 swaps, and so on. And so the total complexity here is that we have n(log^n) swaps overall. Well, that’s all nice theory, but here’s the best part. What should immediately draw your eye is that within any set of swaps every swap is completely in parallel. So if you look at this stage here, for instance, we have 4 swaps but each of them can proceed in parallel. We have 4 swaps in this particular stage. Each of those can proceed in parallel. Here’s a different permutation of swaps, and again these can all be pursued in parallel. And this obviously makes a GPU programmer very, very happy. Now that we know how this sorting network works, let’s think about its performance when given different inputs. For each of the following possibilities with the same number of elements, what are the comparative runtimes in units of time? So a completely sorted sequence, how long does it take, an almost sorted sequence–just about sorted, maybe a couple elements off– a completely reversed sequence, or a completely random sequence? So here this would indicate that A would be the fastest, then B, then D, then C. So our choices are A, B, D, C; C is fastest, then D, B, A; A is fastest, then B, C, D; or they’ll all take the same amount of time.

hay bro what u use to make video…

very nice video…

Good video, i understood very step, i will program this algorithm in CUDA, sorry my bad English

Great video. Explains it very clearly.

Doesn't this still require control flow changes? Gpus might be parallel power houses but if the codepaths diverge often it will slow it down by orders of magnitude. And of course it would be a synchronization nightmare on multicore cpus…

Assuming that swapping two values takes some additional time, I'd say the answer is C<D <B<A. The number of comparisons is the same in all cases, but with C we have to swap basically everything and with A we don't have to swap anything at all. Let me know if I'm right 🙂

Do you mean that sorting networks can't sort a non bitonic sequence?

Awesome, learnt a lot about sorting networks from this paper.