OpenCL Sorting

Eric Bainville - June 2011

Parallel bitonic

The simplicity and regularity of the bitonic sort make it an ideal candidate for experiments. Let's transform the inner loop of the ParallelBitonic_Local kernel into a kernel operating on global memory.

__kernel void ParallelBitonic_A(__global const data_t * in,__global data_t * out,int inc,int dir)
{
  int i = get_global_id(0); // thread index
  int j = i ^ inc; // sibling to compare

  // Load values at I and J
  data_t iData = in[i];
  uint iKey = getKey(iData);
  data_t jData = in[j];
  uint jKey = getKey(jData);

  // Compare
  bool smaller = (jKey < iKey) || ( jKey == iKey && j < i );
  bool swap = smaller ^ (j < i) ^ ((dir & i) != 0);

  // Store
  out[i] = (swap)?jData:iData;
}

The loop on length and inc is now in the host code: we call O(log2(N)) kernels, swapping input/output buffers and enqueuing a barrier at each invocation.

ParallelBitonic_A
log2(N)Key+Value
91.1
101.9
1310
1417
1737
1837
2130
2228
2521
Performance of the ParallelBitonic_A kernel, Mkey/s.

Each thread makes 3 global memory accesses, and we execute L*(L+1)/2 kernels, where L=log2(N). For L=22, we reach an impressive 169 GB/s of global memory I/O rate.

In the above kernel, since we run one thread per value, we perform each comparison i<>j two times: one in thread i, and the other one in thread j. We can make this comparison in a single thread, which will manage both values: 2 read, 2 write, and 1 comparison instead of 4 read, 2 write, and 2 comparisons. This leads to the following kernel, requiring N/2 threads at each pass, and modifying data in-place:

#define ORDER(a,b) { bool swap = reverse ^ (getKey(a)<getKey(b)); \
  data_t auxa = a; data_t auxb = b; a = (swap)?auxb:auxa; b = (swap)?auxa:auxb; }

__kernel void ParallelBitonic_B2(__global data_t * data,int inc,int dir)
{
  int t = get_global_id(0); // thread index
  int low = t & (inc - 1); // low order bits (below INC)
  int i = (t<<1) - low; // insert 0 at position INC
  bool reverse = ((dir & i) == 0); // asc/desc order
  data += i; // translate to first value

  // Load
  data_t x0 = data[  0];
  data_t x1 = data[inc];

  // Sort
  ORDER(x0,x1)

  // Store
  data[0  ] = x0;
  data[inc] = x1;
}

This kernel runs at 49 Mkey/s for Key+Value, exceeding 160 GB/s of global memory I/O rate, and 73 Mkey/s for Key sorting only. The speed limitation comes from the large number of kernels invoked. For example for N=220, we run 210 kernels.

We could collapse two consecutive calls of the kernel with inc and inc/2. Each thread would process 4 values in-place, performing 4 comparisons. This is illustrated in the following kernel:

__kernel void ParallelBitonic_B4(__global data_t * data,int inc,int dir)
{
  inc >> 1;
  int t = get_global_id(0); // thread index
  int low = t & (inc - 1); // low order bits (below INC)
  int i = ((t - low) << 2) + low; // insert 00 at position INC
  bool reverse = ((dir & i) == 0); // asc/desc order
  data += i; // translate to first value

  // Load
  data_t x0 = data[    0];
  data_t x1 = data[  inc];
  data_t x2 = data[2*inc];
  data_t x3 = data[3*inc];

  // Sort
  ORDER(x0,x2)
  ORDER(x1,x3)
  ORDER(x0,x1)
  ORDER(x2,x3)

  // Store
  data[    0] = x0;
  data[  inc] = x1;
  data[2*inc] = x2;
  data[3*inc] = x3;
}

Using the B2 and B4 kernels, parallel bitonic sort reaches 78 Mkey/s for Key+Value, and 115 Mkey/s for Key only. Let's continue in this direction with kernels processing 8 and 16 inputs at a time:

#define ORDERV(x,a,b) { bool swap = reverse ^ (getKey(x[a])<getKey(x[b])); \
      data_t auxa = x[a]; data_t auxb = x[b]; \
      x[a] = (swap)?auxb:auxa; x[b] = (swap)?auxa:auxb; }
#define B2V(x,a) { ORDERV(x,a,a+1) }
#define B4V(x,a) { for (int i4=0;i4<2;i4++) { ORDERV(x,a+i4,a+i4+2) } B2V(x,a) B2V(x,a+2) }
#define B8V(x,a) { for (int i8=0;i8<4;i8++) { ORDERV(x,a+i8,a+i8+4) } B4V(x,a) B4V(x,a+4) }
#define B16V(x,a) { for (int i16=0;i16<8;i16++) { ORDERV(x,a+i16,a+i16+8) } B8V(x,a) B8V(x,a+8) }

__kernel void ParallelBitonic_B8(__global data_t * data,int inc,int dir)
{
  inc >> 2;
  int t = get_global_id(0); // thread index
  int low = t & (inc - 1); // low order bits (below INC)
  int i = ((t - low) << 3) + low; // insert 000 at position INC
  bool reverse = ((dir & i) == 0); // asc/desc order
  data += i; // translate to first value

  // Load
  data_t x[8];
  for (int k=0;k<8;k++) x[k] = data[k*inc];

  // Sort
  B8V(x,0)

  // Store
  for (int k=0;k<8;k++) data[k*inc] = x[k];
}

__kernel void ParallelBitonic_B16(__global data_t * data,int inc,int dir)
{
  inc >> 3;
  int t = get_global_id(0); // thread index
  int low = t & (inc - 1); // low order bits (below INC)
  int i = ((t - low) << 4) + low; // insert 0000 at position INC
  bool reverse = ((dir & i) == 0); // asc/desc order
  data += i; // translate to first value

  // Load
  data_t x[16];
  for (int k=0;k<16;k++) x[k] = data[k*inc];

  // Sort
  B16V(x,0)

  // Store
  for (int k=0;k<16;k++) data[k*inc] = x[k];
}

The use of loops does not slow down the generated code. Using B2, B4, and B8 kernels, we can reach 103 Mkey/s for Key+Value, and 160 Mkey/s for Key only. Unfortunately, a probable bug in the driver (Catalyst 11.5 and 11.6) does not allow to execute the B16 kernel without errors in the output. It runs correctly on a NVIDIA GPU.

ParallelBitonic
KernelsKey+ValueKey
B24973
B2+B478115
B2+B4+B8103160
B2+B4+B8+B16FAILFAIL

When inc becomes small enough, all sorting is done locally, and a single workgroup processes the same data. It is then possible to terminate the inc loop inside a single kernel, inserting barriers to synchronize the threads. The second step would be to replace global memory I/O by local memory I/O for intermediate states. We will implement this in the next page: Parallel bitonic II.