CUDA Instrumentation Tool README ------------------------------------ Please see the LICENSE file for licensing terms. This tool instruments CUDA kernels to detect two types of bugs at run-time: - Race conditions, in which non-deterministic behavior results from multiple unsynchronized threads writing to and possibly reading from the same memory location. - Bank conflicts, in which multiple threads access different addresses in the same physical bank in shared memory, resulting in reduced performance. The tool only instruments memory accesses to shared memory, because the ordering of memory operations among different threads can only be guaranteed for accesses to shared memory, and because only accesses to shared memory can cause bank conflicts. Note that shared memory in CUDA is different than the typical notion of shared memory in many other multi-threaded programming models. Here, shared memory refers to the software-controlled scratchpad memory accessed through variables explicitly marked __shared__ in the source code. Installation: ------------- 1) Unzip and untar the source distribution: tar xvfz cuda_instrument-0.9.tar.gz This will create a directory called cuda_instrument. 1) Enter the cuda_instrument/cil directory and run: ./configure make This will compile CIL, which is used by the tool to parse source files. 2) Enter the cuda_instrument directory and run: make This will compile the actual CUDA instrumentation tool. Usage: ------ 1) Instrument the code: ./cuda_instrument.pl The input and output file names are required. The instrumentation type is optional: use "bank" to specify bank conflict detection (the default) or "race" to specify race condition detection. Example: cuda_instrument.pl cuda_code.cu instr.cu This will process the code in cuda_code.cu, add instrumentation code to check for bank conflicts, and output the resulting code in instr.cu 2) Compile your CUDA program with the instrumented code. Two approaches: - Back up the original source file and then overwrite the original file with the instrumented file. - Back up the original source file and then copy only the instrumented CUDA kernel into the original file in place of the original kernel. The instrumentation code will only compile in emulation mode: - If you call nvcc directly on the command line, add -deviceemu to your command-line arguments to nvcc. - If you use a makefile copied/generated from one of NVIDIA's sample SDK makefiles, add emu=1 to your command-line arguments to make. - If you run CUDA in Windows through Visual Studio, choose EmuDebug or EmuRelease (in the dropdown box to the right of the run/play button in the default layout) and then build your project. 3) Run the compiled program. For each access to shared memory, the instrumented program will print the results of its analysis. For race condition detection, the program will print the type (RAW, WAR) of any race conditions found. For bank conflict detection, the program will print the number of threads accessing each of the sixteen banks as well as the number of unique addresses being accessed in each bank. The maximum of the number of unique addresses across all banks is the minimum number of accesses that must be serialized (see limitation #4 below for an explanation of why this value is only the minimum rather than an exact count). Known limitations: ------------------ 1) The tool cannot deal with shared memory accesses inside of divergent conditional code blocks, because the instrumentation code needs to insert __syncthread() calls, which leads to deadlock. For example: __shared__ int array[NUM_THREADS]; if (threadIdx.x % 2 == 0) array[threadIdx.x] = 0; // Only even numbered threads // execute this statement When this code is instrumented, compiled, and run, it will deadlock, since only the even numbered threads in the thread block will reach the call to __syncthreads() in the instrumentation code. 2) The tool only detects accesses to shared memory when they are through variables explicitly declared as __shared__. For example, the following code would not be detected as accessing shared memory: __shared__ int array[NUM_THREADS]; int *pointer = array; pointer[0] = 5; // Indirect access to shared memory 3) Bank conflicts can only occur within a single half-warp. Although different half-warps can exhibit different memory access patterns, in the interest of simplifying the tool's output, only bank conflicts within the first half-warp are reported. Here, the first half-warp is defined to be the threads with the sixteen smallest thread identifiers (using the canonical thread numbering defined in the CUDA programming guide) in the thread block with the smallest block identifier (using the canonical block numbering). Thus, for programs in which different half-warps exhibit different memory access patterns, the results reported by the tool will not hold for all half-warps. 4) Shared memory features a broadcast mechanism that allows multiple threads to access the same address in shared memory concurrently (with no bank conflicts). Unfortunately, critical details of the broadcast mechanism are left unspecified. Thus, for certain memory access patterns, it is impossible to predict with absolute certainty the number of bank conflicts that will occur. In cases where bank conflicts will definitely occur, the tool will report "Bank conflicts at expression..." In cases where bank conflicts may or may not occur depending on the broadcast behavior, the tool will report "Potential bank conflicts at expression..." See page 57 in the CUDA 1.1 Programming Guide for an explanation of the broadcast mechanism, and see Figure 5-7 on page 61 for an example of a memory access pattern in which the existence of bank conflicts depends on the unspecified behavior of the broadcast mechanism. Troubleshooting: ---------------- Although only CUDA kernels are instrumented, the tool must parse the entire source file, including any host code. If the tool is having trouble parsing your source file, and your source file contains more than just the CUDA kernel that you want to instrument, try creating a new file that contains only the CUDA kernel of interest and re-running the tool on the new file. If you find any bugs, please contact the author of the tool: Michael Boyer boyer@cs.virgina.edu