OpenCL: Performance of atomics

Wednesday, April 10, 2013

Performance of atomics

Atomics in OpenCL are very useful, but if they are not used carefully, severe performance penalties can appear. Let's create simple OpenCL kernel which does sum of ones utilizing atomics:

kernel void AtomicSum(global int* sum){
    atomic_add(sum,1);
}

Let's try to test this kernel running 1024x1024x128 threads:

int sum=0;
cl::Buffer bufferSum = cl::Buffer(context, CL_MEM_READ_WRITE, 1 * sizeof(float));
queue.enqueueWriteBuffer(bufferSum, CL_TRUE, 0, 1 * sizeof(int), &sum);
cl::Kernel kernel=cl::Kernel(program, "AtomicSum");
kernel.setArg(0,bufferSum);
queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(1024*1024*128), cl::NullRange);
queue.finish();

queue.enqueueReadBuffer(bufferSum,CL_TRUE,0,1 * sizeof(int),&sum);
std::cout << "Sum: " << sum << "\n";

Expected sum is: 134217728.
Our test machine uses OpenCL implementation from AMD. CPU is Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz, GPU is AMD Mobility Radeon HD 5470. How much time should this code take on CPU and how much on GPU? We usually expect that operations on GPU are much faster than operations on CPU. Are they really faster? Our test returned next results:

CPU: 1.809s
GPU: 3.262s

This can be quite unexpected. Is it possible to speed up whole thing? Short answer is yes. OpenCL supports utilization of local memory (on chip) which is much faster than global memory. Let's change previous kernel AtomicSum:

kernel void AtomicSum(global int* sum){
    local int tmpSum[1];
    if(get_local_id(0)==0){
        tmpSum[0]=0;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    atomic_add(&tmpSum[0],1);
    barrier(CLK_LOCAL_MEM_FENCE);
    if(get_local_id(0)==(get_local_size(0)-1)){
        atomic_add(sum,tmpSum[0]);
    }
}

This kernel does atomic add at level of work groups by utilizing local memory. At the end each work group does atomic add on global memory (last thread). This approach lovers the access to global memory. It looks promising as the results look too:

CPU: 0.815s
GPU: 0.24s

Speedup on GPU is now more that 10x. On CPU is also not so bad. Overall this is quite a nice speedup. Can we do it even faster? Let's assume that atomic operations on local memory have significant costs to. This cost can be lowered by using more local memory, where each thread tries to do atomic add at different memory locations:

kernel void AtomicSum(global int* sum){
    local int tmpSum[4];
    if(get_local_id(0)<4){
        tmpSum[get_local_id(0)]=0;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    atomic_add(&tmpSum[get_global_id(0)%4],1);
    barrier(CLK_LOCAL_MEM_FENCE);
    if(get_local_id(0)==(get_local_size(0)-1)){
        atomic_add(sum,tmpSum[0]+tmpSum[1]+tmpSum[2]+tmpSum[3]);
    }
}

We got again nice speedup, but it's not four times faster than expected:

CPU: 0.858s
GPU: 0.173s

We found out that atomics cost quite some time. It's recommended to omit atomics on global memory. Atomics at local memory are better but they are always also not the best solution. This applies especially to GPUs, as they can run much more threads in parallel that CPUs. Global atomics on CPUs don't have so big impact on performance. This means that same code can run even faster on CPU than on GPU.