CS6501 Assignment 3: CUDA Programming

The objective of this assignment is to expose you to GPU computing and give you experience with one particular language that is growing in popularity: CUDA (the main competitor at this point in time is OpenCL). The advent of inexpensive commodity hardware (GPUs) that can be programmed simple C extensions (CUDA) allows the mass market to take advantage of data-parallel computing, and many software vendors are adding GPU acceleration using CUDA.

Due 10am, Tuesday, Oct. 18

Introduction

NVIDIA's CUDA allows programmers to use a C-like language to write programs for execution on a GPU. More information can be found in the NVIDIA GPU Computing Documentation center. You will probably find the CUDA C Programming Guide and the CUDA API Reference Manual most useful. You can also find the Occupancy Calculator here. You may also find helpful a chapter (in preprint form) that I have posted on Collab under Resources, from an upcoming book on multicore programming; this chapter gives is mainly an introduction to some key programming concepts for GPUs.

We have provided you with an example CUDA program that computes the sum of two vectors. It computes the sum on the GPU and on the CPU and then compares the two result vectors to check for correctness. It also reports performance statistics.

The actual kernel code is very simple; each thread simply sums a single element from each of the two input vectors and writes the result into the output vector.

__global__ void add_vectors_kernel(float *A, float *B, float *C, int N) {
	// Determine which element this thread is computing
	int block_id = blockIdx.x + gridDim.x * blockIdx.y;
	int thread_id = blockDim.x * block_id + threadIdx.x;
	
	// Compute a single element of the result vector (if the element is valid)
	if (thread_id < N) C[thread_id] = A[thread_id] + B[thread_id];
}

[kernel code for the vector addition example]

The code to prepare the GPU for execution and invoke the kernel is actually significantly longer than the kernel itself. It first transfers the input vectors to GPU memory, then launches the kernel, and finally transfers the result vector back to CPU memory.

// Returns the vector sum A + B (computed on the GPU)
float *GPU_add_vectors(float *A_CPU, float *B_CPU, int N) {

	// Allocate GPU memory for the inputs and the result
	int vector_size = N * sizeof(float);
	float *A_GPU, *B_GPU, *C_GPU;
	cudaMalloc((void **) &A_GPU, vector_size);
	cudaMalloc((void **) &B_GPU, vector_size);
	cudaMalloc((void **) &C_GPU, vector_size);
	
	// Transfer the input vectors to GPU memory
	cudaMemcpy(A_GPU, A_CPU, vector_size, cudaMemcpyHostToDevice);
	cudaMemcpy(B_GPU, B_CPU, vector_size, cudaMemcpyHostToDevice);
	
	// Determine the number of thread blocks in the x- and y-dimension
	int num_blocks = (int) ((float) (N + threads_per_block - 1) / (float) threads_per_block);
	int max_blocks_per_dimension = 65535;
	int num_blocks_y = (int) ((float) (num_blocks + max_blocks_per_dimension - 1) / (float) max_blocks_per_dimension);
	int num_blocks_x = (int) ((float) (num_blocks + num_blocks_y - 1) / (float) num_blocks_y);
	dim3 grid_size(num_blocks_x, num_blocks_y, 1);
	
	// Execute the kernel to compute the vector sum on the GPU
	add_vectors_kernel <<< grid_size , threads_per_block >>> (A_GPU, B_GPU, C_GPU, N);
	
	// Allocate CPU memory for the result
	float *C_CPU = (float *) malloc(vector_size);
	
	// Transfer the result from the GPU to the CPU
	cudaMemcpy(C_CPU, C_GPU, vector_size, cudaMemcpyDeviceToHost);
	
	// Free the GPU memory
	cudaFree(A_GPU);
	cudaFree(B_GPU);
	cudaFree(C_GPU);
	
	return C_CPU;
}
[host code for the vector addition example]

Once you have downloaded and uncompressed the source files, run make to compile the program. Run the program by typing ./vector_add and optionally provide the number of elements in each vector. For example, running ./vector_add 30000000 might produce the following output:

Vector generation: 2.29549 sec

GPU:      Transfer to GPU: 0.37464 sec
         Kernel execution: 0.00732 sec
        Transfer from GPU: 0.18250 sec
                    Total: 0.58041 sec

CPU: 0.19306 sec

CPU outperformed GPU by 3.01x

All values correct

[example output of vector addition program]

You may notice something interesting about the performance results. Even though the actual computation time (Kernel execution above) on the GPU is about 25 times faster than the computation time on the CPU, the overall execution time on the CPU is about 3 times faster. This is because the execution time of the GPU version is dominated by the overhead of copying data between the CPU and GPU memories. This is an important lesson for CUDA developers: it only makes sense to execute something on the GPU when there is a significant amount of computation being performed on each data element.

Problem 1 - Find the Maximum

Problem: Write a CUDA program that, given an N-element vector, finds the largest element.

Your solution should take as input N and generate a randomized vector V of length N (as in the example program described above). It should then compute the maximum value in V on the CPU and on the GPU. The program should output the two computed maximum values as well as the time taken to find each value.

The following is an example of the expected program output (with N = 100,000,000 in this case):

Vector generation: 7.29615 sec

GPU:      Transfer to GPU: 0.59613 sec
         Kernel execution: 0.06669 sec
        Transfer from GPU: 0.00002 sec
                    Total: 0.68033 sec

CPU: 0.33637 sec

CPU outperformed GPU by 2.02x

GPU max: 224110944.000000
CPU max: 224110944.000000

[example output of maximum finding program]

Your goal is to find the maximum using CUDA as efficiently as you can. For the purposes of this assignment, we will ignore the overhead of copying the vector to the GPU. Thus, your goal is to minimize the rest of the execution time of the GPU solution. In other words, you should try to minimize the sum of the Kernel execution and Transfer from GPU times.

Hints:

Make a copy of the vector_add source directory and modify this as a template for your new program. (Don't forget to give the files suitable names.)
Remember that threads in different thread blocks cannot directly communicate. Start by assuming that the total number of elements in the vector is less than the number of threads in one thread block. After you have a correct solution for this case, try to extend it to support arbitrary vector sizes.
When threads within the same thread block need to communicate, use shared memory rather than global memory. Accessing shared memory is orders of magnitude faster than accessing global memory. (You may be interested to compare the performance benefit of shared memory vs. only using global memory.) Remember also to synchronize all of the threads within a thread block using the __syncthreads() function where necessary.
FLT_MIN (defined in float.h) is the smallest possible single-precision floating point number. This can be useful when you are trying to find the maximum value across an entire thread block but not every thread has a valid value. Similarly, FLT_MAX is the largest possible single-precision floating point number.

Problem 2 - Max, Min, Mean, and Standard Deviation

Problem: Extend your solution to the previous problem so that it efficiently finds and reports the following statistics:

The maximum element in the vector
The minimum element in the vector
The arithmetic mean of the vector
The standard deviation of the values in the vector

The following is an example of the expected program output (with N = 100,000,000 in this case):

Vector generation: 7.28857 sec

GPU:      Transfer to GPU: 0.53311 sec
         Kernel execution: 0.08253 sec
        Transfer from GPU: 0.00006 sec
                    Total: 0.63385 sec

CPU: 0.62369 sec

CPU outperformed GPU by 1.02x

GPU max: 224110944.000000
CPU max: 224110944.000000

GPU min: 0.000000
CPU min: 0.000000

GPU mean: 224110944.000000
CPU mean: 224110944.000000

GPU sigma: 168.314159
CPU sigma: 168.314159

[example output of maximum, minimum, mean, and
standard deviation finding program]

Notice that, because the amount of computation per data element has increased, the CPU and GPU implementations are closer in performance.

Hints:

Rather than creating one kernel for each of the statistics, try to compute as many of the statistics as possible in a single kernel. Computing the statistics concurrently can significantly improve performance, because the overhead of launching a kernel is non-negligible. For extra credit, do it both ways (separately and together) and measure the difference. For further extra credit, compare the results using single- vs. double-precision.
Recall that, even though the addition of real numbers is commutative and associative, in general floating-point addition is not. Thus, it is possible (and even likely) that computing the mean on the GPU will produce a slightly or even significantly different result than on the CPU. In fact, it is likely that the mean value computed by the GPU will be significantly more accurate for very large vector sizes. (These precision issues are one reason why debugging parallel programs is hard!)

Submission Instructions

You will work on any of {barracuda12,barracuda13,tesla}.cs.virginia.edu. If one is down, use another. I suggest to access the machines remotely via ssh or similar. Some of the cray nodes also have suitable GPUs, but these require pbs -i for access. Note that, to get accurate timings, you will need to make sure no one else is using the GPU when you do.

You will need to get a CS departmental account.

Important: You will need to make sure your shell knows which paths to look in libraries. Make sure your LD_LIBRARY_PATH includes /usr/local/cuda/lib -- depending on your shell, the syntax for doing this will vary. For bash users, one way to do this is:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib

But you have to do this everytime you log in. You may want to add /usr/local/cuda/lib to your .profile or .bashrc. If you are encountering problems with other libraries, the error message usually tells you which paths are missing. You can also run
ldd [executable_file]
to see which libraries your executable uses and which ones it cannot find.

To submit this assignment, tar up all your code (Makefile and *.cu) as well as your README file, and upload this to Collab as part of your assignment submission. Please *separately* attach a brief report documenting your findings, and briefly explain the algorithms you implemented and why you chose them. You may also want to discuss the sequence of optimization steps you went through to arrive at your final algorithm.
The easiest way to get your tar file from barracuda machine is to use a secure ftp program like SecureFX or sftp, or else scp.