Choosing Between Pinned and Non-Pinned Memory:

When allocating CPU memory that will be used to transfer data to the GPU, there are two types of memory to choose from: pinned and non-pinned memory. Pinned memory is memory allocated using the cudaMallocHost function, which prevents the memory from being swapped out and provides improved transfer speeds. Non-pinned memory is memory allocated using the malloc function. As described in Memory Management Overhead and Memory Transfer Overhead, pinned memory is much more expensive to allocate and deallocate but provides higher transfer throughput for large memory transfers.

This raises an obvious question: how much memory needs to be transferred in order for pinned memory to provide a performance advantage?

We answer this question under two sets of assumptions. In both cases, we assume that memory is allocated and deallocated exactly once. In the first case, shown in Figure 1 below, we assume that the allocated memory is used to transfer data both to and from the GPU, and both transfers are the same size. This would be true in the case where the sizes of a kernel's input and output datasets are equal. In the second case, shown in Figure 2 below, we assume that the entire allocated memory space is transferred to the GPU, but only a single data element is transferred back from the GPU. This would be true in the case where the kernel performs a reduction, like in the computation of the average of a large set of numbers.

In the first case (the two-way transfer), we can see from Figure 1 that using pinned memory only makes sense when the amount of memory transferred each way is larger than 16 MB.


Figure 1: Time required to allocate, transfer to the GPU, transfer back to the CPU, and deallocate pinned and non-pinned memory. Note the logarithmic scales.


In the second case (the one-way transfer), the two lines are almost indistinguishable at larger memory sizes (they would be more easily distinguishable on a non-log plot).


Figure 2: Time required to allocate, transfer to the GPU, and deallocate pinned and non-pinned memory. Note the logarithmic scales.


Figure 3 shows the overall transfer speedup of pinned memory over non-pinned memory as a function of the size of each transfer. Here we can see that, in the case of the one-way transfer, using pinned memory only makes sense when transferring more than 128 MB.


Figure 3: Speedup in transfer time due to the use of pinned memory. Note the logarithmic scale on the x-axis.


Of course, the assumptions used here may not hold in any particular real-world application. For example, the allocated memory may be used to perform more than one transfer to the GPU. In that case, the amount of memory per transfer required for pinned memory to provide a performance advantage would be smaller than shown above. Figure 4 shows, for a given memory size, how many CPU-to-GPU transfers would need to be made in order for pinned memory to provide the same or better overall performance as non-pinned memory. For small memory sizes, the number of transfers required is astronomical. As we would expect, the number of transfers required decreases steadily as the size of each transfer increases.


Figure 4: Number of memory transfers required for pinned memory to provide performance equivalent to non-pinned memory. Note the logarithmic scales.