CUDA Support/Enabling double-precision
Problem
CUDA by default does not support double-precision floating point arithmetic, and the CUDA compiler silently converts doubles into floats inside of kernels.
Solution
If you are working on a machine with a GPU that supports double-precision, add "--gpu-name sm_13" to the command line options passed to nvcc.
More Information
Early CUDA-enabled GPUs did not provide double-precision support. The most recent GPUs, such as the GTX 260 and GTX 280, do support double-precision. However, by default the CUDA compiler does not use double-precision arithmetic.
There are a few approaches to addressing this problem:
- Use single-precision floating point numbers exclusively in any CUDA-related code; this may be acceptable for many applications without strict precision requirements
- Use double-precision floating point numbers and add the following flag when calling nvcc: "--gpu-name sm_13". This flag tells nvcc that the GPU that the code will execute on is new enough to provide double-precision support. Note that code compiled in this manner will not execute on an older GPU. If you are compiling your CUDA files through MATLAB, you need to add the --gpu-name flag shown above to COMPFLAGS in nvmexopts.bat.
- Define your own datatype (which we call Real) as follows: "#define Real float" or "#define Real double". By using "Real" throughout the application instead of "float" or "double", you can easily switch between single- and double-precision by simply changing the definition of Real.
Explanation
As stated above, by default the CUDA compiler silently converts doubles into floats inside of kernels. To demonstrate the problems that may arise because of this, consider the following simple kernel:
__global__ void kernel(double *input_array, double *output_array) {
output_array[threadIdx.x] = sqrt(input_array[threadIdx.x]);
}
When this is compiled by nvcc, it is (silently) converted to the following:
__global__ void kernel(float *input_array, float *output_array) {
output_array[threadIdx.x] = input_array[threadIdx.x];
}
By itself, this change might not be a significant issue. The real problem is that the host (CPU) code is not changed in the same manner. Most likely the kernel above would be invoked by code similar to the following:
// Allocate memory on the GPU
double *device_input, *device_output;
int mem_size = N * sizeof(double);
cudaMalloc( (void **) &device_input, mem_size);
cudaMalloc( (void **) &device_output, mem_size);
// Copy the input data to the GPU
cudaMemcpy(device_input, host_input, mem_size, cudaMemcpyHostToDevice);
// Launch the kernel
kernel <<< N, 1 >>> (device_input, device_output);
// Copy back the results from the GPU
cudaMemcpy(host_output, device_output, mem_size, cudaMemcpyDeviceToHost);
Since this host code is not changed from double- to single-precision, when this program is compiled and executed the following will happen:
- N double-precision floating point numbers will be copied to the GPU
- Each half of the first N/2 double-precision numbers will be read by the kernel as a single-precision number, which is obviously incorrect
- N double-precision numbers will be copied back to the CPU. The first N/2 numbers will each be the concatenation of the square roots of two of the incorrectly read values, while the second N/2 numbers will be uninitialized.
Performance
On the GTX 280 & 260, a multiprocessor has eight single-precision floating point ALUs (one per core) but only one double-precision ALU (shared by the eight cores). Thus, for applications whose execution time is dominated by floating point computations, switching from single-precision to double-precision will increase runtime by a factor of approximately eight. For applications which are memory bound, enabling double-precision will only decrease performance by a factor of about two.
Back to CUDA Support