The Heart Wall application tracks the movement of a mouse heart over a sequence of 104 609x590 ultrasound images to record response to the stimulus. In its initial stage, the program performs image processing operations on the first image to detect initial, partial shapes of inner and outer heart walls. These operations include: edge detection, SRAD despeckling (also part of Rodinia suite), morphological transformation and dilation. In order to reconstruct approximated full shapes of heart walls, the program generates ellipses that are superimposed over the image and sampled to mark points on the heart walls (Hough Search). In its final stage (Heart Wall Tracking presented here), program tracks movement of surfaces by detecting the movement of image areas under sample points as the shapes of the heart walls change throughout the sequence of images.
SRAD (Speckle Reducing Anisotropic Diffusion) is a diffusion method for ultrasonic and radar imaging applications based on partial differential equations (PDEs). It is used to remove locally correlated noise, known as speckles, without destroying important image features.
Our CUDA Implementation is based on the Matlab code provided by Prof. Scott Acton's group in the U.Va Department of Electrical Engineering. The typical inputs to the program are the ultrasound images with each point representing a pixel in the image. Currently the computation grid in our released CUDA version is filled with random float numbers. The details of the algorithm is provided in the article:
Y. Yu, S. Acton, Speckle reducing anisotropic diffusion, IEEE Transactions on Image Processing 11(11)(2002) 1260-1270 pdf
Discussion of SRAD in the context of Heart Wall application is available here: L.G. Szafaryn, K. Skadron and J. Saucerman. "Experiences Accelerating MATLAB Systems Biology Applications." in Workshop on Biomedicine in Computing (BiC) at International Symposium on Computer Architecture (ISCA), June 2009. http://www.cs.virginia.edu/~lgs9a/publications/isca_bic_09.pdf
Tracking presented here is the final stage of the Heart Wall application. The actual parallel computation that accounts for almost all of the execution time (depending on the number of ultrasound images) takes place in this stage. Therefore we present it separated from the other stages of the application to allow easy analysis of this particular type of workload.
Tracking stage takes the positions of heart walls from the first ultrasound image in the sequence as determined by the initial detection stage in the application. Tracking process is implemented in the form of multiple nested loops that process batches of 10 frames and 51 points in each image. Displacement of heart walls is detected by comparing currently processed frame to the template frame which is updated after processing a batch of frames. There is a sequential dependency between processed frames. The processing of each point consist of a large number of small serial steps with interleaved control statements. Each of the steps involves a small amount of computation performed only on a subset of entire image.
The original code was written in MATLAB (interpreted with some functions compiled) with performance approaching that of pure C version. Multi-threaded OpenMP version of the code running on a quad-core processor achieves over 4x speedup compared to single-threaded C version. Partitioning of the working set between caches and avoiding of cache-trashing contribute to the performance. When running CUDA version, the GPU hardware is underutilized because of the limited amount of computation at each computation step. Also the GPU overhead (data transfer and kernel launch) are significant. Large size of processed images and lack temporal locality did not allow for utilization of fast shared memory. In order to provide better speedup (15x), more drastic GPU optimization techniques that sacrificed modularity (in order to include code in one kernel call) were required. These techniques also combined unrelated functions and data transfers in single kernels.
In order to run the application, the number of frames to be processed needs to be specified as a parameter on the command line. This number cannot be larger than the number of frames in the video file that the application uses (104 frames in this case).
For more information, see:
Paper: L.G. Szafaryn, K. Skadron and J. Saucerman. "Experiences Accelerating MATLAB Systems Biology Applications." in Workshop on Biomedicine in Computing (BiC) at the International Symposium on Computer Architecture (ISCA), June 2009. http://www.cs.virginia.edu/~lgs9a/publications/isca_bic_09.pdf
Presentation Slides: L.G. Szafaryn, K. Skadron. "Experiences Accelerating MATLAB Systems Biology Applications - Heart Wall Tracking". http://www.cs.virginia.edu/~lgs9a/rodinia/hwt/hwt.ppt