CUDA Kernel Overhead:
Executing a kernel on the GPU is a relatively expensive operation.
By default, kernel calls are asynchronous,
meaning that a CUDA program can proceed to the next instruction
before the kernel actually completes execution. In practice, however,
many CUDA programs launch a kernel and then immediately transfer back
the result. Calling the cudaMemcpy function forces the CPU to wait
until kernel execution has completed, effectively making the kernel
call a synchronous operation.
To measure the overhead of kernel execution, we measured the average
time required to launch an empty kernel
over a large number of kernel invocations. Figure 1 shows the time
per kernel call on four different GPUs across three different machines.
The asynchronous bar represents the case where the kernel was invoked
repeatedly with no synchronization between calls. The synchronous bar
represents the case where the cudaThreadSynchronize function was called
after each kernel call, forcing the CPU to wait for the current kernel
to finish execution before calling the next kernel.
The time per asynchronous kernel call represents a fundamental
overhead, most likely due to the overhead of accessing the GPU driver.
However, if a program has enough work to perform on the CPU while the
kernel executes, it should be able to effectively hide the rest of the
latency of the kernel call. Here, that would represent about two-thirds
of the overall kernel latency. For real kernels, that would represent
an even larger fraction of the overall kernel latency, because presumably
the latency of accessing the driver is independent of the characteristics
of the kernel being launched.
Figure 1: Time per empty kernel call.
Figure 2 shows the kernel call throughput achievable by each hardware
configuration, measured in thousands of kernel calls per second.
Note that this is essentially the inverse of Figure 1. We can see from
the graph that the highest kernel throughput achieved is about 334,000
calls per second (on Barracuda10). For a point of comparison, with a 3.20
GHz Core 2 Extreme processor (also on Barracuda10), an empty CPU function
can be called approximately 300,000,000 times per second, or approximately
three orders of magnitude faster.
Figure 2: Number of empty kernel calls per second.