OpenCL Sorting
Eric Bainville - June 2011Parallel merge, local
Here again, we sort in local memory first. We execute N threads in workgroups of WG threads, and each workgroup sorts a segment of WG input records. The output contains N/WG ordered sequences.
In merge sort, the main step of the algorithm is to merge together two sorted sequences. Suppose a0,a1,...,ap-1 and b0,b1,...,bp-1 are two sorted sequences. The position of ai in the output is i+f(b,ai), where f(seq,x) is the number of values in sequence seq that are less than x. Since b is sorted, this number can be obtained by dichotomic search in log2(p) iterations. The same goes for the position of bi in the output. This is implemented in the following kernel, each thread executes its own full dichotomic search in the sibling sequence:
__kernel void ParallelMerge_Local(__global const data_t * in,__global data_t * out,__local data_t * aux) { int i = get_local_id(0); // index in workgroup int wg = get_local_size(0); // workgroup size = block size, power of 2 // Move IN, OUT to block start int offset = get_group_id(0) * wg; in += offset; out += offset; // Load block in AUX[WG] aux[i] = in[i]; barrier(CLK_LOCAL_MEM_FENCE); // make sure AUX is entirely up to date // Now we will merge sub-sequences of length 1,2,...,WG/2 for (int length=1;length<wg;length<<=1) { data_t iData = aux[i]; uint iKey = getKey(iData); int ii = i & (length-1); // index in our sequence in 0..length-1 int sibling = (i - ii) ^ length; // beginning of the sibling sequence int pos = 0; for (int inc=length;inc>0;inc>>=1) // increment for dichotomic search { int j = sibling+pos+inc-1; uint jKey = getKey(aux[j]); bool smaller = (jKey < iKey) || ( jKey == iKey && j < i ); pos += (smaller)?inc:0; pos = min(pos,length); } int bits = 2*length-1; // mask for destination int dest = ((ii + pos) & bits) | (i & ~bits); // destination index in merged sequence barrier(CLK_LOCAL_MEM_FENCE); aux[dest] = iData; barrier(CLK_LOCAL_MEM_FENCE); } // Write output out[i] = aux[i]; }
ParallelMerge_Local | |
---|---|
WG | Top speed |
1 | 178 |
2 | 150 |
4 | 163 |
8 | 203 |
16 | 280 |
32 | 406 |
64 | 616 |
128 | 486 |
256 | 383 |
The code is a little tricky to write correctly :-). Each thread performs O(1) global memory accesses, and O(log2(WG)) local memory accesses. We can observe this O(log2(WG)) behaviour for the largest values.
OpenCL Sorting : Parallel bitonic II | Top of Page | OpenCL FFT |