The code calculates particle potential and relocation due to mutual forces between particles within a large 3D space. This space is divided into cubes, or large boxes, that are allocated to individual cluster nodes. The large box at each node is further divided into cubes, called boxes. 26 neighbor boxes surround each box (the home box). Home boxes at the boundaries of the particle space have fewer neighbors. Particles only interact with those other particles that are within a cutoff radius since ones at larger distances exert negligible forces. Thus the box size s chosen so that cutoff radius does not span beyond any neighbor box for any particle in a home box, thus limiting the reference space to a finite number of boxes.
This code  was derived from the ddcMD application  by rewriting the front end and structuring it for parallelization. This code represents MPI task that runs on a single cluster node. While the details of the code are somewhat different than the original, the code retains the structure of the MPI task in the original code. Since the rest of MPI code is not included here, the application first emulates MPI partitioning of the particle space into boxes. Then, for every particle in the home box, the nested loop processes interactions first with other particles in the home box and then with particles in all neighbor boxes. The processing of each particle consists of a single stage of calculation that is enclosed in the innermost loop. The nested loops in the application were parallelized in such a way that at any point of time GPU warp/wavefront accesses adjacent memory locations. The speedup depends on the number of boxes, particles (fixed) and the actualcal culation for each particle (fixed). The application is memory bound, and GPU speedup seems to saturate at about 16x when compared to single-core CPU.
 L. G. Szafaryn, T. Gamblin, B. deSupinski and K. Skadron. "Experiences with Achieving Portability across Heterogeneous Architectures." Submitted to WOLFHPC workshop at 25th International Conference on Supercomputing (ICS). Tucson, AZ. 2010. (pdf)
 F. H. Streitz, J. N. Glosli, M. V. Patel, B. Chan, R. K. Yates, B. R. de Supinski, J. Sexton, J and A. Gunnels. "100+ TFlop Solidification Simulations on BlueGene/L." In Proceedings of the 2005 Supercomputing Conference (SC 05). Seattle, WA. 2005. (pdf)
 L. G. Szafaryn, T. Gamblin, B. deSupinski and K. Skadron. "Experiences with Achieving Portability across Heterogeneous Architectures." Submitted to WOLFHPC workshop at 25th International Conference on Supercomputing (ICS). Tucson, AZ. 2010. (ppt)
OpenMP Version: (tar.gz)
CUDA Version: (tar.gz)
OpenCL Version: (tar.gz)