From Rodinia
Revision as of 23:58, 13 May 2015 by Kw5na (Talk | contribs) (Version History:)

Jump to: navigation, search


  • Package Structure
       rodinia_2.3/bin : binary executables
       rodinia_2.3/common : common configuration file
       rodinia_2.3/cuda : source code for the CUDA implementations
       rodinia_2.3/data : input files
       rodinia_2.3/openmp : source code for the OpenMP implementations
       rodinia_2.3/opencl : source code for the OpenCL implementations
  • Build Rodinia
       1. Install CUDA/OCK drivers, SDK and toolkit on your machine
       2. Modify the rodinia_2.3/common/make.config file 
       3. 'Make' the programs (We also provide Makefiles for each benchmark)
  • Run Rodinia
       There is a 'run' file specifying the sample command to run each program.

Version History:

  • July. 23, 2014: Rodinia 3.0 is released
    • 1. Bug fix

1). OpenMP version BFS (Thanks Erik Alnervik from Linköping University in Sweden)

   Deleted the tesing code line #88 in bfs.cpp
   86 //read the source node from the file
   87 fscanf(fp,"%d",&source);
   88    source=0;

2). CUDA version NW (Thanks Bradley M. Small from Harmonia Holdings Group)

   removed unused array : matrix_cuda_out.

3). Pathfinder OpenCL version (Thanks Michael Boyer and Shuai from AMD)

   Fix kernel source-file reading problem.

4). Leukocyte OpenCL version (Thanks Michael Boyer from AMD, thanks Alexander Grund and Guido Juckeland from Dresden)

   Fixed many bugs.
   - Memory operation
   - Faster kernel (replaced some div and mod)
   - Kernel compiled only once (reduce unrelated overhead for better results)
   - Incorporated 0/0 fix

5). NN OpenMP version (Thanks Shuai Che from AMD)

   fix makefile

6). particlefilter OpenMP, CUDA and OpenCL version (Thanks Shuai Che from AMD)

   a) make PI a hardcoded constant instead of using acos(-1)
   b) fix makefile

7). CFD OpenCL (Thanks Alexander Grund and Guido Juckeland from Dresden)

Fixed a few possible out-of-bounds errors in the kernel.
    • 2. Multiple inputs:
   Input generator for MummGPU  
    • 3. New feature - configurable thread-group size (also called adjustable kernel dimension. OpenCL & CUDA version, thanks Alexander Grund and Guido Juckeland from Dresden and Wenhao Jia from Princeton)
   The size of a thread group for a kernel can be configured with a set of compilation options. 
   kmeans, lavamd, cfd, nw, hotspot, lud, gaussian, srad, heartwall, bplustree
    • 4. New feature - Output for verification (OpenCL, CUDA and OpenMP version)
   LavaMD, Heartwall and Leukocyte
    • 5. New benchmarks:
  1) Hybridsort
  2) DWT2D

  • Aug. 12, 2013: Rodinia 2.4 is released
    • 1. Leukocyte OpenCL

Repartition work group of GICOV_kernel, and let the work group size can be specified.

    • 2. Kmeans OpenCL

Repartition work group of all kernels, and let the work group size of them can be specified.

    • 3. Input generator for NN
    • 4. Kmeans CUDA

In #include <> -> #include ""

    • 5. Input generator for BFS
    • 6. Input generator for Kmeans
    • 7. Input generator for Hotspot
    • 8. Correct LavaMD rv_cpu array initialization in OpenMP, OpenCL and CUDA version.
    • 9. Correct many bugs of myocyte OpenMP version.
    • 10. Leukocyte

Correct cudaMemcpyToSymbol function call in for CUDA 5.0.

    • 11. B+tree. Reduce group size to make it work on AMD GPU and APU.
    • 12. Add built-in input generator to LUD
    • 13. Add built-in input generator to gauss elimination.
    • 14. Add __global__ in the declarations of Fan1 and Fan2 of gauss elimination to avoid compile error in CUDA 5.0
   Add -lOpenCL in the OPENCL_LIB definition in common/make.config
   OPENCL_LIB = $(OPENCL_DIR)/OpenCL/common/lib -lOpenCL (gcc-4.6+ compatible) 
    • B. OpenCL
   1. Particlefilter OpenCL
   a) Runtime work group size selection based on device limits
   b) Several bugs of kernel fixed
   c) Initialize all arrays on host side and device side
   d) Fix objxy_GPU array across boundary access on device
    objxy_GPU = clCreateBuffer(context, CL_MEM_READ_WRITE, 2*sizeof (int) *countOnes, NULL, &err);
   err = clEnqueueWriteBuffer(cmd_queue, objxy_GPU, 1, 0, 2*sizeof (int) *countOnes, objxy, 0, 0, 0);
   e) #define PI  3.1415926535897932  in ex_particle_OCL_naive_seq.cpp
   f) put  -lOpenCL just behind -L$(OPENCL_LIB) in Makefile. 
   g) delete an useless function tex1Dfetch() from
   h) add single precision version!
   2. B+Tree OpenCL
   a) Replace CUDA function __syncthreads() with OpenCL barrier(CLK_LOCAL_MEM_FENCE) in kernel file
   3. Heartwall OpenCL
   a) Lower work item size from 512 to 256 (Better compatibility with AMD GPU)
   b) Several bugs fixed on kernel codes
   c) Several bugs fixed on host codes
   4. BSF OpenCL  
   a). Replace all bool with char since bool is NOT a valid type for OpenCL arguments .
   b). -lOpenCL just behind -L$(OPENCL_LIB) in Makefile. (gcc-4.6+ compatible) 
   c). remove NVIDIA-specific parameters and decrease thread block size for Better compatibility with AMD GPU
   //std::string options= "-cl-nv-verbose"; // doesn't work on AMD machines
   resultCL = clBuildProgram(oclHandles.program, deviceListSize, oclHandles.devices, NULL, NULL,? NULL);
   #define MAX_THREADS_PER_BLOCK 256 // 512 is too big for my AMD Fusion GPU
   d) Correct bad mallocs
   oclHandles.devices = (cl_device_id *)malloc(deviceListSize * sizeof(cl_device_id));
   d_mem = clCreateBuffer(oclHandles.context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, size, h_mem_ptr, &oclHandles.cl_status);
   d_mem = clCreateBuffer(oclHandles.context, CL_MEM_WRITE_ONLY | CL_MEM_COPY_HOST_PTR, size, h_mem_ptr, &oclHandles.cl_status);
   h_mem_pinned = (cl_float *)clEnqueueMapBuffer(oclHandles.queue, d_mem_pinned, CL_TRUE,? \
    CL_MAP_WRITE, 0, size, 0, NULL,? \
   d_graph_mask = _clMallocRW(no_of_nodes*sizeof(bool), h_graph_mask);
   d_updating_graph_mask = _clMallocRW(no_of_nodes*sizeof(bool), h_updating_graph_mask);
   d_graph_visited = _clMallocRW(no_of_nodes*sizeof(bool), h_graph_visited);
   compare_results<int>(h_cost_ref, h_cost, no_of_nodes);
   f)  Add #include <cstdlib> in bfs.cpp
   g) Conditional including time.h 
   5. CFD OpenCL
   a) Comment out two useless clWaitForEvents commands in CLHelper.h. It will get 1.5X speedup on some GPUs. 
   b) -lOpenCL just behind -L$(OPENCL_LIB) in Makefile. (gcc-4.6+ compatible) 
   c) cfd/CLHelper.h
   oclHandles.devices = (cl_device_id *)malloc(sizeof(cl_device_id) * deviceListSize);
   6. Backprop OpenCL. 
   a) Opencl doesn’t support integer log2 and pow line 40 & 42 To:
   for ( int i = 1 ; i <= HEIGHT ; i=i*2){                                      
   int power_two = i;
   b) Change if( device_list ) delete device_list; to
   if( device_list ) delete[] device_list; 
   7. gaussianElim OpenCL
   a) Add codes to release device buffer at the end of ForwardSub() function (gaussianElim.cpp)
   b) gaussian/gaussianElim.cpp 
   Add cl_cleanup();   after free(finalVec);

   8. Lavamd OpenCL: In lavaMD/kernel/kernel_gpu_opencl_wrapper.c
   add : #include <string.h>
   9. pathfinder OpenCL
   a) OpenCL.cpp: add #include <cstdlib>
   b) Makefile: Changed the plase of -lOpenCL for better compatibility of gcc-4.6+.
   10. streamcluster OpenCL: In CLHelper.h
   oclHandles.devices = (cl_device_id *)malloc(sizeof(cl_device_id)*deviceListSize);
   11. Hotspot OpenCL: In hotspot.c add clReleaseContext(context); 
   before main function return.
   12. kmeans OpenCL: Add shutdown() in main function to release CL resource before quit.
    • C. CUDA
   1. CFD CUDA: solve compatablity problem with CUDA 5.0.
   2. Backprop CUDA: Correct include command in
   3. BFS CUDA: Correct include command in
   4. kmeans CUDA: Add “-lm” in link command.
   5. nn CUDA: Fix makefile bugs
   6. mummergpu CUDA 
   a) add #include <stdint.h>  to
   b)  correct void boardMemory function parameters types.
   c) Rename getRef function to getRefGold in mummergpu_gold.cpp to avoid multiple definition
    • D. OpenMP
   1. Kmeans OpenMP
   Rename variable max_dist to min_dist in kmeans_clustering.c in kmeans_openmp/ and kmeans_serial/ folders to avoid misunderstanding. 
   For bug reports and fixes:
   Thanks Alexey Kravets, Georgia Kouveli and Elena Stohr from CARP project. Thanks Maxim Perminov from Intel. Thanks Daniel Lustig from Princeton. 
   Thanks John Andrew Stratton from UIUC. Thanks Mona Jalal from University of Wisconsin.

  • Oct. 09, 2012: Rodinia 2.2 is released
   - BFS: Delete invalid flag CL_MEM_USE_HOST_PTR from _clMallocRW and _clMalloc functions in opencl verion. 
      Thanks Alexey Kravets (CARP European research project).
   - Hotspot: correct the index calculation as grid_cols *loadYidx + loadXidx. Correct the same problem in 
      Thanks Alexey Kravets.
   - Pathfinder: Added two __syncthreads in dynproc_kernel function of CUDA version to avoid data race. 
      Thanks Ronny Krashinsky(Nvidia company) and Jiayuan Meng(Argonne National Laboratory).
      Alexey Kravets found and corrected the same problem in opencl version. 
   - SRAD: Replace CUDA function __syncthreads() in srad OpenCL kernel with OpenCL barrier(CLK_LOCAL_MEM_FENCE).
   - NN: Fixed the bug of CUDA version on certain input sizes. 
      The new version detects excess of x-dimension size limit of a CUDA block grid and executes a two-dimensional grid if needed.
      (Only cuda version has this problem)
   - Promote B+Tree to main distribution (with output)
   - Promote Myocyte to main distribution (with output)
  • June 27, 2012: Rodinia 2.1 is released
   - Include fixes for SRAD, Heartwall, Particle Filter and Streamcluster
  • Nov 23, 2011: Rodinia 2.0.1 is released
   - Include a CUDA version of NN comparable to the OCL version.
   - Use a new version of clutils that is BSD, not GPL.
  • Nov 11, 2011: Rodinia 2.0 is released
   - Include several applications into the main suite:
      lavaMD, Gaussian Elimination, Pathfinder, k-Nearest Neighbor and Particle Filter. 
      Detailed application information can also be found at
   - Merge new OpenCL implementations into the main tarball.
  • Mar 01, 2010: Rodinia 1.0 is released