# OpenCL Sorting

Eric Bainville - June 2011## Parallel merge, local

Here again, we sort in local memory first. We execute N threads in workgroups of WG threads, and each workgroup sorts a segment of WG input records. The output contains N/WG ordered sequences.

In merge sort, the main step of the algorithm is to merge together two sorted sequences. Suppose
a_{0},a_{1},...,a_{p-1} and b_{0},b_{1},...,b_{p-1} are two sorted sequences. The position of
a_{i} in the output is i+f(b,a_{i}), where f(seq,x) is the number of values in sequence seq
that are less than x. Since b is sorted,
this number can be obtained by dichotomic search in log_{2}(p) iterations. The same goes for
the position of b_{i} in the output. This is implemented in the following kernel, each thread
executes its own full dichotomic search in the sibling sequence:

__kernel void ParallelMerge_Local(__global const data_t * in,__global data_t * out,__local data_t * aux) { int i = get_local_id(0); // index in workgroup int wg = get_local_size(0); // workgroup size = block size, power of 2 // Move IN, OUT to block start int offset = get_group_id(0) * wg; in += offset; out += offset; // Load block in AUX[WG] aux[i] = in[i]; barrier(CLK_LOCAL_MEM_FENCE); // make sure AUX is entirely up to date // Now we will merge sub-sequences of length 1,2,...,WG/2 for (int length=1;length<wg;length<<=1) { data_t iData = aux[i]; uint iKey = getKey(iData); int ii = i & (length-1); // index in our sequence in 0..length-1 int sibling = (i - ii) ^ length; // beginning of the sibling sequence int pos = 0; for (int inc=length;inc>0;inc>>=1) // increment for dichotomic search { int j = sibling+pos+inc-1; uint jKey = getKey(aux[j]); bool smaller = (jKey < iKey) || ( jKey == iKey && j < i ); pos += (smaller)?inc:0; pos = min(pos,length); } int bits = 2*length-1; // mask for destination int dest = ((ii + pos) & bits) | (i & ~bits); // destination index in merged sequence barrier(CLK_LOCAL_MEM_FENCE); aux[dest] = iData; barrier(CLK_LOCAL_MEM_FENCE); } // Write output out[i] = aux[i]; }

ParallelMerge_Local | |
---|---|

WG | Top speed |

1 | 178 |

2 | 150 |

4 | 163 |

8 | 203 |

16 | 280 |

32 | 406 |

64 | 616 |

128 | 486 |

256 | 383 |

The code is a little tricky to write correctly :-). Each thread performs O(1) global memory accesses, and O(log^{2}(WG)) local
memory accesses. We can observe this O(log^{2}(WG)) behaviour for the largest values.

OpenCL Sorting : Parallel bitonic II | Top of Page | OpenCL FFT |