Difference between revisions of "LavaMD"

From Rodinia
Jump to: navigation, search
(Created page with 'Myocyte application models cardiac myocyte (heart muscle cell) and simulates its behavior according to the work by Saucerman and Bers [8]. The model integrates cardiac myocyte el...')
 
Line 1: Line 1:
Myocyte application models cardiac myocyte (heart muscle cell) and simulates its behavior according to the work by Saucerman and Bers [8]. The model integrates cardiac myocyte electrical activity with the calcineurin pathway, which is a key aspect of the development of heart failure. The model spans large number of temporal scales to reflect how changes in heart rate as observed during exercise or stress contribute to calcineurin pathway activation, which ultimately leads to the expression of numerous genes that remodel the heart’s structure. It can be used to identify potential therapeutic targets that may be useful for the treatment of heart failure. Biochemical reactions, ion transport and electrical activity in the cell are modeled with 91 ordinary differential equations (ODEs) that are determined by more than 200 experimentally validated parameters. The model is simulated by solving this group of ODEs for a specified time interval. The process of ODE solving is based on the causal relationship between values of ODEs at different time steps, thus it is mostly sequential. At every dynamically determined time step, the solver evaluates the model consisting of a set of 91 ODEs and 480 supporting equations to determine behavior of the system at that particular time instance. If evaluation results are not within the expected tolerance at a given time step (usually as a result of incorrect determination of the time step), another calculation attempt is made at a modified (usually reduced) time step. Since the ODEs are stiff (exhibit fast rate of change within short time intervals), they need to be simulated at small time scales with an adaptive step size solver.
+
The code calculates particle potential and relocation due to mutual forces between particles within a large 3D space. This space is
 +
divided into cubes, or large boxes, that are allocated to individual cluster nodes. The large box at each node is further divided into
 +
cubes, called boxes. 26 neighbor boxes surround each box (the home box). Home boxes at the boundaries of the particle space have fewer neighbors.  
 +
Particles only interact with those other particles that are within a cutoff radius since ones at larger distances exert negligible forces. Thus the  
 +
box size s chosen so that cutoff radius does not span beyond any neighbor box for any particle in a home box, thus limiting the reference space to
 +
a finite number of boxes.
  
The original code used MATLAB ode45 ODE solver. In the process of accelerating this code, we arrived with the intermediate versions that used single-threaded Sundials CVODE solver which evaluated parallelized model (either OpenMP or CUDA) at each time step. In order to convert entire solver to OpenMP and CUDA codes (to remove some of the operational overheads such as thread/kernel launches in OpenMP/CUDA, respectively, and data transfer overhead in CUDA) we used a simpler solver, from Mathematics Source Library, and tailored it to our needs. The parallelism in the cardiac myocyte model is on a very fine-grained level, close to that of ILP, therefore it is very hard to exploit as DLP or TLB in OpenMP or CUDA code. We were able to divide the model into 4 individual groups that run in parallel. However, even that is not enough work to compensate for some of the OpenMP/CUDA thread/kernel launch overheads, respectively, as well as CUDA data transfer overhead which resulted in performance worse than that of single-threaded C code. Speedup in this code could be achieved only if a customizable accelerator such as FPGA was used for evaluation of the model itself. We also approached the application from another angle and allowed it to run several concurrent simulations, thus turning it into an embarrassingly parallel problem. This version of the code is also useful for scientists who want to run the same simulation with different sets of input parameters. Speedup achieved with CUDA code is variable on the other hand. It depends on the number of concurrent simulations and it saturates around 300 simulations.
+
This code [1] was derived from the ddcMD application [2] by rewriting the front end and structuring it for parallelization. This code represents MPI
 +
task that runs on a single cluster node. While the details of the code are somewhat different than the original, the code retains the structure of the  
 +
MPI task in the original code. Since the rest of MPI code is not included here, the application first emulates MPI partitioning of the particle space
 +
into boxes. Then, for every particle in the home box, the nested loop processes interactions first with other particles in the home box and then with
 +
particles in all neighbor boxes. The processing of each particle consists of a single stage of calculation that is enclosed in the innermost loop.
  
Speedup numbers reported in the description of this application were obtained on the machine with: Intel Quad Core CPU, 4GB of RAM, Nvidia GTX280 GPU.
+
More information about the parallel version of this code can be found in:
 
+
[1] L. G. Szafaryn, T. Gamblin, B. deSupinski and K. Skadron. "Experiences with Achieving Portability across Heterogeneous Architectures." Submitted to
For more information, see:<br>
+
WOLFHPC workshop at 25th International Conference on Supercomputing (ICS). Tucson, AZ. 2010.
Papers:<br>
+
More about the original ddcMD application can be found in:
L. G. Szafaryn, K. Skadron, and J. J. Saucerman. "Experiences Accelerating MATLAB Systems Biology Applications." In Proceedings of the Workshop on Biomedicine in Computing: Systems, Architectures, and Circuits (BiC) 2009, in conjunction with the 36th IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2009. ([http://www.cs.virginia.edu/~skadron/Papers/BiC09.pdf pdf])<br>
+
[2] F. H. Streitz, J. N. Glosli, M. V. Patel, B. Chan, R. K. Yates, B. R. de Supinski, J. Sexton, J and A. Gunnels. "100+ TFlop Solidification Simulations
Presentation Slides:<br>
+
on BlueGene/L." In Proceedings of the 2005 Supercomputing Conference (SC 05). Seattle, WA. 2005.
L. G. Szafaryn, K. Skadron, and J. J. Saucerman. "Experiences Accelerating MATLAB Systems Biology Applications - Myocyte". ([http://www.cs.virginia.edu/~lgs9a/rodinia/myocyte/myocyte.ppt ppt])<br>
+
  
 
OpenMP Version:<br>
 
OpenMP Version:<br>

Revision as of 01:07, 16 April 2011

The code calculates particle potential and relocation due to mutual forces between particles within a large 3D space. This space is divided into cubes, or large boxes, that are allocated to individual cluster nodes. The large box at each node is further divided into cubes, called boxes. 26 neighbor boxes surround each box (the home box). Home boxes at the boundaries of the particle space have fewer neighbors. Particles only interact with those other particles that are within a cutoff radius since ones at larger distances exert negligible forces. Thus the box size s chosen so that cutoff radius does not span beyond any neighbor box for any particle in a home box, thus limiting the reference space to a finite number of boxes.

This code [1] was derived from the ddcMD application [2] by rewriting the front end and structuring it for parallelization. This code represents MPI task that runs on a single cluster node. While the details of the code are somewhat different than the original, the code retains the structure of the MPI task in the original code. Since the rest of MPI code is not included here, the application first emulates MPI partitioning of the particle space into boxes. Then, for every particle in the home box, the nested loop processes interactions first with other particles in the home box and then with particles in all neighbor boxes. The processing of each particle consists of a single stage of calculation that is enclosed in the innermost loop.

More information about the parallel version of this code can be found in: [1] L. G. Szafaryn, T. Gamblin, B. deSupinski and K. Skadron. "Experiences with Achieving Portability across Heterogeneous Architectures." Submitted to WOLFHPC workshop at 25th International Conference on Supercomputing (ICS). Tucson, AZ. 2010. More about the original ddcMD application can be found in: [2] F. H. Streitz, J. N. Glosli, M. V. Patel, B. Chan, R. K. Yates, B. R. de Supinski, J. Sexton, J and A. Gunnels. "100+ TFlop Solidification Simulations on BlueGene/L." In Proceedings of the 2005 Supercomputing Conference (SC 05). Seattle, WA. 2005.

OpenMP Version:
Code (tar.gz)
Input (tar.gz)

CUDA Version:
Code (tar.gz)
Input (tar.gz)