This site contains OpenCL notes, tutorials, benchmarks, news.

Sunday, September 29, 2013

Dynamic parallelism in OpenCL 2.0

Provisional specifications of OpenCL 2.0 were released few months ago. One of the very interesting features is support for dynamic parallelism. In CUDA world it already exist for about a year but still only on the most expensive devices with compute capability 3.5 (Titan, GTX780; booth with chip GK110). On AMD side it a little bit different story. They didn't talk anything about dynamic parallelization but on the other side they introduced GCN 2.0 which might have support for it. In addition they introduced Mantle - a new GPU API which promises up to 9 times more draw calls than comparable API's (OpenGL, DirectX). This might smell that draw calls might be called from the GPU itself.

How will be dynamic parallelization used? Very simple. Kernels will enque kernels to a device queue:
int enqueue_kernel (
 queue_t queue,
 kernel_enqueue_flags_t flags,
 const ndrange_t ndrange,
 my_block_A);
First argument requires the queue; you can use the one from the host.

Take care as this function is asynchronous. The parent kernel will not wait for its child kernels but it will be vice versa. Second argument of enqueue_kernel will define if child kernels will start running while the parent kernel is still running (CLK_ENQUEUE_FLAGS_NO_WAIT), wait for the parent kernel to finish (CLK_ENQUEUE_FLAGS_WAIT_KERNEL), or wait only for a work-group of parent kernel to finish (CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP).

Third argument defines amount of threads to run. For example ndrange_1D(global_work_size, local_work_size) can be used.

And the last block defines the actual function to run as a kernel. It is defined by using the Block syntax:
void (^my_block_A)(void) =
 ^{ size_t id = get_global_id(0);
 b[id] += a[id];
};
Usage of enqueue_kernel seems quite easy, but there are some possible problems. How much kernels can we run in a such way? We can run out of memory. If it happens, enqueue_kernel returns CL_ENQUEUE_FAILURE (in debug mode CLK_DEVICE_QUEUE_FULL). So it seems that OpenCL code requires a lot of error handling which is not optimal for SIMD machinery.

1 comment:

  1. I suspect that this will only work on GPUs that are compatible with dynamic parallelism at the hardware level. Is this correct? If so, do you know which AMD GPUs will support this, if any?

    ReplyDelete