CUDA Kernel Overhead:

Executing a kernel on the GPU is a relatively expensive operation. By default, kernel calls are asynchronous, meaning that a CUDA program can proceed to the next instruction before the kernel actually completes execution. In practice, however, many CUDA programs launch a kernel and then immediately transfer back the result. Calling the cudaMemcpy function forces the CPU to wait until kernel execution has completed, effectively making the kernel call a synchronous operation.

To measure the overhead of kernel execution, we measured the average time required to launch an empty kernel over a large number of kernel invocations. Figure 1 shows the time per kernel call on four different GPUs across three different machines. The asynchronous bar represents the case where the kernel was invoked repeatedly with no synchronization between calls. The synchronous bar represents the case where the cudaThreadSynchronize function was called after each kernel call, forcing the CPU to wait for the current kernel to finish execution before calling the next kernel.

The time per asynchronous kernel call represents a fundamental overhead, most likely due to the overhead of accessing the GPU driver. However, if a program has enough work to perform on the CPU while the kernel executes, it should be able to effectively hide the rest of the latency of the kernel call. Here, that would represent about two-thirds of the overall kernel latency. For real kernels, that would represent an even larger fraction of the overall kernel latency, because presumably the latency of accessing the driver is independent of the characteristics of the kernel being launched.

Figure 1: Time per empty kernel call.

Figure 2 shows the kernel call throughput achievable by each hardware configuration, measured in thousands of kernel calls per second. Note that this is essentially the inverse of Figure 1. We can see from the graph that the highest kernel throughput achieved is about 334,000 calls per second (on Barracuda10). For a point of comparison, with a 3.20 GHz Core 2 Extreme processor (also on Barracuda10), an empty CPU function can be called approximately 300,000,000 times per second, or approximately three orders of magnitude faster.

Figure 2: Number of empty kernel calls per second.