Difference between revisions of "Heart Wall"

From Rodinia
Jump to: navigation, search
 
(12 intermediate revisions by one other user not shown)
Line 1: Line 1:
The Heart Wall application tracks the movement of a mouse heart over a sequence of 104 609x590 ultrasound images to record response to the stimulus. In its initial stage, the program performs image processing operations on the first image to detect initial, partial shapes of inner and outer heart walls. These operations include: edge detection, SRAD despeckling (also part of Rodinia suite), morphological transformation and dilation. In order to reconstruct approximated full shapes of heart walls, the program generates ellipses that are superimposed over the image and sampled to mark points on the heart walls (Hough Search). In its final stage (Heart Wall Tracking presented here), program tracks movement of surfaces by detecting the movement of image areas under sample points as the shapes of the heart walls change throughout the sequence of images.  
+
The Heart Wall application tracks the movement of a mouse heart over a sequence of 104 609x590 ultrasound images to record response to the stimulus. In its initial stage, the program performs image processing operations on the first image to detect initial, partial shapes of inner and outer heart walls. These operations include: edge detection, SRAD despeckling (also part of Rodinia suite) [2], morphological transformation and dilation. In order to reconstruct approximated full shapes of heart walls, the program generates ellipses that are superimposed over the image and sampled to mark points on the heart walls (Hough Search). In its final stage (Heart Wall Tracking presented here) [1], program tracks movement of surfaces by detecting the movement of image areas under sample points as the shapes of the heart walls change throughout the sequence of images.  
  
 
Only two stages of the application, SRAD and Tracking, contain enough data/task parallelism to be interesting for Rodinia suite, and therefore they are presented here separately. The separation of these two parts of the application allows easy analysis of the distinct types of workloads. However, in the near future, we plan to convert the entire application to OpenMP and CUDA and make all of its parts available together as well as separately in the Rodinia suite.
 
Only two stages of the application, SRAD and Tracking, contain enough data/task parallelism to be interesting for Rodinia suite, and therefore they are presented here separately. The separation of these two parts of the application allows easy analysis of the distinct types of workloads. However, in the near future, we plan to convert the entire application to OpenMP and CUDA and make all of its parts available together as well as separately in the Rodinia suite.
  
For more information, see:<br>
 
 
Papers:<br>  
 
Papers:<br>  
[1] L. G. Szafaryn, K. Skadron, and J. J. Saucerman. "Experiences Accelerating MATLAB Systems Biology Applications." In Proceedings of the Workshop on Biomedicine in Computing: Systems, Architectures, and Circuits (BiC) 2009, in conjunction with the 36th IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2009. ([http://www.cs.virginia.edu/~skadron/Papers/BiC09.pdf pdf])<br>
+
[1] L. G. Szafaryn, K. Skadron, and J. J. Saucerman. "Experiences Accelerating MATLAB Systems Biology Applications." In Proceedings of the Workshop on Biomedicine in Computing: Systems, Architectures, and Circuits (BiC) 2009, in conjunction with the 36th IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2009. ([http://www.cs.virginia.edu/~lgs9a/publications/09_isca_bic_paper.pdf pdf])<br>
[2] Y. Yu, S. Acton, Speckle reducing anisotropic diffusion, IEEE Transactions on Image Processing 11(11)(2002) 1260-1270 ([http://people.virginia.edu/~sc5nf/01097762.pdf pdf])<br>
+
[2] Y. Yu, S. Acton, Speckle reducing anisotropic diffusion, IEEE Transactions on Image Processing 11(11)(2002) 1260-1270. ([http://www.cs.virginia.edu/~lgs9a/rodinia/heartwall/srad/paper_2.pdf pdf])<br>
 +
 
 
Presentation Slides:<br>
 
Presentation Slides:<br>
[3] L. G. Szafaryn, K. Skadron. "Experiences Accelerating MATLAB Systems Biology Applications - Heart Wall Tracking". ([http://www.cs.virginia.edu/~lgs9a/rodinia/heart_wall/tracking/tracking.ppt ppt])<br>
+
[1] L. G. Szafaryn, K. Skadron. "Experiences Accelerating MATLAB Systems Biology Applications - Heart Wall Tracking". ([http://www.cs.virginia.edu/~lgs9a/rodinia/heartwall/tracking/tracking.ppt ppt])<br>
 
+
===SRAD===
+
 
+
SRAD is one of the first stages of the Heart Wall application. SRAD (Speckle Reducing Anisotropic Diffusion) is a diffusion method for ultrasonic and radar imaging applications based on partial differential equations (PDEs). It is used to remove locally correlated noise, known as speckles, without destroying important image features. SRAD consists of several pieces of work: image extraction, continuous iterations over the image (preparation, reduction, statistics, computation 1 and computation 2) and image compression. The sequential dependency between all of these stages requires synchronization after each stage (because each stage operates on the entire image).
+
 
+
Partitioning of the working set between caches and avoiding of cache trashing contribute to the performance. In CUDA version, each stage is a separate kernel (due to synchronization requirements) that operates on data already residing in GPU memory. In order to improve GPU performance data was transferred to GPU at the beginning of the code and then transferred back to CPU after all of the computation stages were completed in GPU. Some of the kernels use GPU shared memory for additional improvement in performance. Speedup achievable with CUDA version depends on the image size (up to the point where GPU saturates).
+
 
+
OpenMP Version:<br>
+
Code ([http://www.cs.virginia.edu/~lgs9a/rodinia/heart_wall/srad/hw_srad_openmp_code.tar.gz tar.gz])<br>
+
Input ([http://www.cs.virginia.edu/~lgs9a/rodinia/heart_wall/srad/hw_srad_openmp_input.tar.gz tar.gz])<br>
+
 
+
CUDA Version:<br>
+
Code ([http://www.cs.virginia.edu/~lgs9a/rodinia/heart_wall/srad/hw_srad_cuda_code.tar.gz tar.gz])<br>
+
Input ([http://www.cs.virginia.edu/~lgs9a/rodinia/heart_wall/srad/hw_srad_cuda_input.tar.gz tar.gz])<br>
+
  
 
===Tracking===
 
===Tracking===
Line 30: Line 16:
 
Partitioning of the working set between caches and avoiding of cache trashing contribute to the performance. CUDA implementation of this code is a classic example of the exploitation of braided parallelism. Processing of sample points is assigned to multiprocessors (TLP), while processing of individual pixels in each sample image is assigned to processors inside each multiprocessor. However, each GPU multiprocessor is usually underutilized because of the limited amount of computation at each computation step. Large size of processed images and lack temporal locality did not allow for utilization of fast shared memory. Also the GPU overhead (data transfer and kernel launch) are significant. In order to provide better speedup, more drastic GPU optimization techniques that sacrificed modularity (in order to include code in one kernel call) were used. These techniques also combined unrelated functions and data transfers in single kernels.
 
Partitioning of the working set between caches and avoiding of cache trashing contribute to the performance. CUDA implementation of this code is a classic example of the exploitation of braided parallelism. Processing of sample points is assigned to multiprocessors (TLP), while processing of individual pixels in each sample image is assigned to processors inside each multiprocessor. However, each GPU multiprocessor is usually underutilized because of the limited amount of computation at each computation step. Large size of processed images and lack temporal locality did not allow for utilization of fast shared memory. Also the GPU overhead (data transfer and kernel launch) are significant. In order to provide better speedup, more drastic GPU optimization techniques that sacrificed modularity (in order to include code in one kernel call) were used. These techniques also combined unrelated functions and data transfers in single kernels.
  
OpenMP Version:<br>
+
<!--
Code ([http://www.cs.virginia.edu/~lgs9a/rodinia/heart_wall/tracking/hw_tracking_openmp_code.tar.gz tar.gz])<br>
+
Input ([http://www.cs.virginia.edu/~lgs9a/rodinia/heartwall/tracking/hw_tracking_input.tar.gz tar.gz]) ([http://www.cs.virginia.edu/~lgs9a/rodinia/heartwall/tracking/hw_tracking_input_2.tar.gz tar.gz])<br>
Input ([http://www.cs.virginia.edu/~lgs9a/rodinia/heart_wall/tracking/hw_tracking_openmp_input.tar.gz tar.gz])<br>
+
 
 +
OpenMP Version: ([http://www.cs.virginia.edu/~lgs9a/rodinia/heartwall/tracking/hw_tracking_openmp_code.tar.gz tar.gz])<br>
 +
 
 +
CUDA Version: ([http://www.cs.virginia.edu/~lgs9a/rodinia/heartwall/tracking/hw_tracking_cuda_code.tar.gz tar.gz])<br>
  
CUDA Version:<br>
+
OpenCL Version: ([http://www.cs.virginia.edu/~lgs9a/rodinia/heartwall/tracking/hw_tracking_opencl_code.tar.gz tar.gz])<br>
Code ([http://www.cs.virginia.edu/~lgs9a/rodinia/heart_wall/tracking/hw_tracking_cuda_code.tar.gz tar.gz])<br>
+
-->
Input ([http://www.cs.virginia.edu/~lgs9a/rodinia/heart_wall/tracking/hw_tracking_cuda_input.tar.gz tar.gz])<br>
+

Latest revision as of 18:39, 25 June 2015

The Heart Wall application tracks the movement of a mouse heart over a sequence of 104 609x590 ultrasound images to record response to the stimulus. In its initial stage, the program performs image processing operations on the first image to detect initial, partial shapes of inner and outer heart walls. These operations include: edge detection, SRAD despeckling (also part of Rodinia suite) [2], morphological transformation and dilation. In order to reconstruct approximated full shapes of heart walls, the program generates ellipses that are superimposed over the image and sampled to mark points on the heart walls (Hough Search). In its final stage (Heart Wall Tracking presented here) [1], program tracks movement of surfaces by detecting the movement of image areas under sample points as the shapes of the heart walls change throughout the sequence of images.

Only two stages of the application, SRAD and Tracking, contain enough data/task parallelism to be interesting for Rodinia suite, and therefore they are presented here separately. The separation of these two parts of the application allows easy analysis of the distinct types of workloads. However, in the near future, we plan to convert the entire application to OpenMP and CUDA and make all of its parts available together as well as separately in the Rodinia suite.

Papers:
[1] L. G. Szafaryn, K. Skadron, and J. J. Saucerman. "Experiences Accelerating MATLAB Systems Biology Applications." In Proceedings of the Workshop on Biomedicine in Computing: Systems, Architectures, and Circuits (BiC) 2009, in conjunction with the 36th IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2009. (pdf)
[2] Y. Yu, S. Acton, Speckle reducing anisotropic diffusion, IEEE Transactions on Image Processing 11(11)(2002) 1260-1270. (pdf)

Presentation Slides:
[1] L. G. Szafaryn, K. Skadron. "Experiences Accelerating MATLAB Systems Biology Applications - Heart Wall Tracking". (ppt)

Tracking

Tracking is the final stage of the Heart Wall application. It takes the positions of heart walls from the first ultrasound image in the sequence as determined by the initial detection stage in the application. Tracking code is implemented in the form of multiple nested loops that process batches of 10 frames and 51 points in each image. Displacement of heart walls is detected by comparing currently processed frame to the template frame which is updated after processing a batch of frames. There is a sequential dependency between processed frames. The processing of each point consist of a large number of small serial steps with interleaved control statements. Each of the steps involves a small amount of computation performed only on a subset of entire image. This stage of the application accounts for almost all of the execution time (the exact ratio depends on the number of ultrasound images).

Partitioning of the working set between caches and avoiding of cache trashing contribute to the performance. CUDA implementation of this code is a classic example of the exploitation of braided parallelism. Processing of sample points is assigned to multiprocessors (TLP), while processing of individual pixels in each sample image is assigned to processors inside each multiprocessor. However, each GPU multiprocessor is usually underutilized because of the limited amount of computation at each computation step. Large size of processed images and lack temporal locality did not allow for utilization of fast shared memory. Also the GPU overhead (data transfer and kernel launch) are significant. In order to provide better speedup, more drastic GPU optimization techniques that sacrificed modularity (in order to include code in one kernel call) were used. These techniques also combined unrelated functions and data transfers in single kernels.