Rodinia2 3changelog

From Rodinia
Jump to: navigation, search
  • Rodinia 2.3 Change log
    • A. General
   Add -lOpenCL in the OPENCL_LIB definition in common/make.config
   OPENCL_LIB = $(OPENCL_DIR)/OpenCL/common/lib -lOpenCL (gcc-4.6+ compatible) 
    • B. OpenCL
   1. Particlefilter OpenCL
   a) Runtime work group size selection based on device limits
   b) Several bugs of kernel fixed
   c) Initialize all arrays on host side and device side
   d) Fix objxy_GPU array across boundary access on device
   objxy_GPU = clCreateBuffer(context, CL_MEM_READ_WRITE, 2*sizeof (int) *countOnes, NULL, &err);
   and 
   err = clEnqueueWriteBuffer(cmd_queue, objxy_GPU, 1, 0, 2*sizeof (int) *countOnes, objxy, 0, 0, 0);
   Thanks Maxim Perminov from Intel.
   e) #define PI  3.1415926535897932  in ex_particle_OCL_naive_seq.cpp
   f) put  -lOpenCL just behind -L$(OPENCL_LIB) in Makefile. 
   g) delete an useless function tex1Dfetch() from particle_float.cl.
   h) add single precision version!
   2. B+Tree OpenCL  Thanks Alexey Kravets  and Elena Stohr from CARP project 
   a) Replace CUDA function __syncthreads() with OpenCL barrier(CLK_LOCAL_MEM_FENCE) in kernel file
   3. Heartwall OpenCL
   a) Lower work item size from 512 to 256 (Better compatibility with AMD GPU)
   b) Several bugs fixed on kernel codes
   c) Several bugs fixed on host codes
   4. BSF OpenCL  Thanks Daniel Lustig from Princeton and Elena Stohr from CARP project.
   a). Replace all bool with char since bool is NOT a valid type for OpenCL arguments .
   b). -lOpenCL just behind -L$(OPENCL_LIB) in Makefile. (gcc-4.6+ compatible) 
   c). remove NVIDIA-specific parameters and decrease thread block size for Better compatibility with AMD GPU
   BFS/CLHelper.h: 240
   -std::string options= "-cl-nv-verbose";
   +//std::string options= "-cl-nv-verbose"; // doesn't work on AMD machines
   - resultCL = clBuildProgram(oclHandles.program, deviceListSize, oclHandles.devices, options.c_str(), NULL,? NULL);
   + resultCL = clBuildProgram(oclHandles.program, deviceListSize, oclHandles.devices, NULL, NULL,? NULL);
   bfs.cpp:
   -#define MAX_THREADS_PER_BLOCK 512
   +#define MAX_THREADS_PER_BLOCK 256 // 512 is too big for my AMD Fusion GPU
   d) Correct bad mallocs
   BFS/CLHelper.h
   204:
   -oclHandles.devices = (cl_device_id *)malloc(deviceListSize);
   +oclHandles.devices = (cl_device_id *)malloc(deviceListSize * sizeof(cl_device_id));
   414:
   - d_mem = clCreateBuffer(oclHandles.context, CL_MEM_READ_WRITE, size, h_mem_ptr, &oclHandles.cl_status);
   + d_mem = clCreateBuffer(oclHandles.context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, size, h_mem_ptr, &oclHandles.cl_status);
   428:
   - d_mem = clCreateBuffer(oclHandles.context, CL_MEM_READ_ONLY, size, h_mem_ptr, &oclHandles.cl_status);
   +d_mem = clCreateBuffer(oclHandles.context, CL_MEM_WRITE_ONLY | CL_MEM_COPY_HOST_PTR, size, h_mem_ptr, &oclHandles.cl_status);
   466:
   h_mem_pinned = (cl_float *)clEnqueueMapBuffer(oclHandles.queue, d_mem_pinned, CL_TRUE,? \
   - CL_MAP_WRITE, 0, size, NULL, NULL,? \
   + CL_MAP_WRITE, 0, size, 0, NULL,? \
   bfs.cpp
   80: // fix RO vs RW
   -d_graph_mask = _clMalloc(no_of_nodes*sizeof(bool), h_graph_mask);
   -d_updating_graph_mask = _clMalloc(no_of_nodes*sizeof(bool), h_updating_graph_mask);
   - d_graph_visited = _clMalloc(no_of_nodes*sizeof(bool), h_graph_visited);
   + d_graph_mask = _clMallocRW(no_of_nodes*sizeof(bool), h_graph_mask);
   + d_updating_graph_mask = _clMallocRW(no_of_nodes*sizeof(bool), h_updating_graph_mask);
   + d_graph_visited = _clMallocRW(no_of_nodes*sizeof(bool), h_graph_visited);
   274:
   - compare_results<int>(h_cost, h_cost_ref, no_of_nodes);
   + compare_results<int>(h_cost_ref, h_cost, no_of_nodes);
   g)  Add #include <cstdlib> in bfs.cpp
   h) Conditional including time.h 
   #ifdef  PROFILING
   #include "timer.h"
   #endif
   5. CFD OpenCL
   a) Comment out two useless clWaitForEvents commands in CLHelper.h. It will get 1.5X speedup on some GPUs. 
   b) -lOpenCL just behind -L$(OPENCL_LIB) in Makefile. (gcc-4.6+ compatible) 
   c) cfd/CLHelper.h
   clHandles.devices = (cl_device_id *)malloc(deviceListSize);	
   to
   oclHandles.devices = (cl_device_id *)malloc(sizeof(cl_device_id) * deviceListSize);
   6. Backprop OpenCL. Thanks  Alexey Kravets and Elena Stohr from CARP project.
   a) Opencl doesn’t support integer log2 and pow
   backprop_kernel.cl line 40 & 42:
   for ( int i = 1 ; i <= log2(HEIGHT) ; i++){                                           
   int power_two = pow(2, i);  
   To:
   for ( int i = 1 ; i <= HEIGHT ; i=i*2){                                      
   int power_two = i;
   b) Change if( device_list ) delete device_list;
   if( device_list ) delete[] device_list;
   7. gaussianElim OpenCL
   a) Add codes to release device buffer at the end of ForwardSub() function (gaussianElim.cpp)
   b) gaussian/gaussianElim.cpp 
   Add cl_cleanup();   after free(finalVec);
  
   8. Lavamd OpenCL  Thanks Elena Stohr from CARP project
   lavaMD/kernel/kernel_gpu_opencl_wrapper.c
   add : #include <string.h>
   9. pathfinder OpenCL
   a) OpenCL.cpp
   add #include <cstdlib>
   b) Makefile
   Changed the plase of -lOpenCL for better compatibility of gcc-4.6+.
  10. streamcluster OpenCL Thanks Elena Stohr from CARP project
   CLHelper.h
   oclHandles.devices = (cl_device_id *)malloc(sizeof(cl_device_id)*deviceListSize);
   11. Hotspot OpenCL  Thanks Elena Stohr from CARP project
   hotspot.c 
   Add        clReleaseContext(context);
   before main function return.
   12. kmeans OpenCL  Thanks Elena Stohr from CARP project
   add shutdown() in main function to release CL resource before quit.
    • C. CUDA
   1. CFD CUDA
   Since culit library is no longer in use in CUDA 5.0. Cuilt functions are replaced by the corresponding functions in <helper_cuda.h> and <helper_timer.h>:
   CUT_SAFE_CALL and CUDA_SAFE_CALL are replaced by checkCudaErrors. And the cut timing functions are also replaced by sdk timing funtions.
   2. Backprop CUDA.
   In backprop_cuda.cu: 
   #include <backprop_cuda_kernel.cu>  to #include “backprop_cuda_kernel.cu”
   3. BFS CUDA
   In bfs.cu
   #include <kernel.cu>
   #include <kernel2.cu>
   to 
   #include "kernel.cu"
   #include "kernel2.cu"
   4. kmeans CUDA  Thanks Mona Jalal from University of Wisconsin 
   Add “-lm” in link command.
   5. nn CUDA Thanks Mona Jalal from University of Wisconsin 
   Fix makefile bugs
   6. mummergpu CUDA  Thanks Mona Jalal from University of Wisconsin 
   a) add #include <stdint.h>  to
   mummergpu_gold.cpp
   mummergpu_main.cpp
   suffix-tree.cpp
   b) mummergpu.cu:  
   Change
   void boardMemory(unsigned int * free_mem, unsigned int * total_mem)
   to
   void boardMemory(size_t * free_mem, size_t * total_mem)
   And also the parameter types for calling boardMemory function. 
   c) Rename getRef function to getRefGold in mummergpu_gold.cpp to avoid multiple definition ( the definition in mummergpu_kernel.cu) 
    • D. OpenMP
   1. Kmeans OpenMP
   Rename variable max_dist to min_dist in kmeans_clustering.c in kmeans_openmp/ and kmeans_serial/ folders to avoid misunderstanding.