Memory Allocation Overhead:

Allocating CPU memory: Allocating memory to be accessed by the CPU is most often accomplished using the C standard library function malloc. An alternative approach is to use CUDA's cudaMallocHost function, which provides page-locked memory that provides higher transfer throughput between CPU and GPU memory. The increase in throughput using memory allocated by cudaMallocHost is about 2.4x on Barracuda10, 2.0x on Barracuda04, and 1.5x on Barracuda01. See Memory Transfer Overhead for more details.

Figure 1 below shows the amount of time taken by each function per allocation as the amount of memory allocated per request increases, as measured on Barracuda10. The time per call to malloc is less than 0.1 microseconds for small allocations. This time steadily increases from 512 B up to 4 KB, where it remains relatively constant at about 1 to 2 microseconds.

Unfortunately, allocating memory using cudaMallocHost is significantly more expensive than using malloc. The time per call to cudaMallocHost remains relatively constant at approximately 2,300 microseconds per call until around 1 MB, where it starts increasing. At the largest memory size of 512 MB, cudaMallocHost takes an incredible 61 milliseconds. Overall, cudaMallocHost is anywhere from three to five orders of magnitude slower than malloc.

Allocating GPU memory: Allocating memory to be accessed by the GPU is most often accomplished using the CUDA function cudaMalloc. There are alternatives, such as cudaMallocPitch and cudaMallocArray, but they are not explored here; we would not expect them to be any faster than cudaMalloc. Figure 1 shows the time taken by cudaMalloc per allocation. The time per call is relatively constant at slightly less than 1 microsecond for memory sizes less than 256 bytes. There is a significant increase between 2 KB and 4 KB, and then a relatively constant overhead of about 50 microseconds up to 512 KB. Above 512 KB, the overhead increases significantly, up to 12.5 milliseconds at 512 MB. Overall, cudaMalloc is about one and a half orders of magnitude slower than malloc for memory sizes less than 4 MB. Above 4 MB, cudaMalloc is two to four orders of magnitude slower than malloc.


Figure 1: Time per allocation as a function of the number of bytes allocated per call. Note the logarithmic scales.


As expected, Figure 2 shows that the allocation time per byte decreases as the number of bytes per call increases.


Figure 2: Allocation time per byte as a function of the number of bytes allocated per call. Note the logarithmic scales.


Memory Deallocation Overhead:

Each memory allocation function has an equivalent deallocation function: free for malloc, cudaFree for cudaMalloc, and cudaFreeHost for cudaMallocHost. Figure 3 below shows the time per call to each deallocation function as the amount of memory freed per call increases. The trends are similar to those in Figure 1. Overall, cudaFreeHost is about three to five orders of magnitude slower than free, while cudaFree is about two to three orders of magnitude slower than free.


Figure 3: Time per deallocation as a function of the number of bytes deallocated per call. Note the logarithmic scales.


Figure 4 shows that, as we would expect, the deallocation time per byte decreases as the number of bytes per call increases.


Figure 4: Deallocation time per byte as a function of the number of bytes freed per call. Note the logarithmic scales.