QG_GYRE and Memory Bandwidth

The QG_GYRE code is a benchmark version of a two-dimensional ocean model. The model solves a second-order finite-difference approximation to the vorticity equation that governs the vertically averaged, large-scale oceanic flow. From the computational point of view, the calculations are strongly dominated by unit-stride vector operations.

One major bottleneck in the code is the solution of a linear, constant-coefficient elliptic equation to recover the streamfunction from the new vorticity field at each time step. This equation is solved by Fourier transformation along the second index of the array (vectorizing over the first index), followed by solution of the resulting independent tridiagonal matrices. This second step has a data dependency on the first array index. On Cray machines, the compiler automatically vectorizes on the second index (using non-unit strides). On hierarchical-memory machines, it is generally more efficient to keep the data dependence in the inner loop and use unit strides, though for small array sizes (or very deep pipelines) it is preferable to place the second index innermost and suffer the resulting increase in cache misses.

The QG_GYRE benchmark code has been in existence for several years, and has undergone several revisions. The results here are from all three versions of the code (QGBOX, QGBENCH, and QG_GYRE), but they have been editted to remove cases where extensive "tweaking" was performed or where the run size was too small relative to the machine's cache.

Correlations

The correlations looked at here are between the QG_GYRE performance in MFLOPS (based on operation counts from the Cray Hardware Performance Monitor) and:
  1. STREAM Triad MFLOPS
  2. SPECfp92
  3. Peak MFLOPS
The STREAM Triad MFLOPS represent the sustainable memory bandwidth for uncached operands. The SPECfp92 measure is not in units of MFLOPS, but the size is very close to the "Peak MFLOPS" and so it is convenient to use it without further scaling. The Peak MFLOPS is the "guaranteed not to exceed" specification from the manufacturer.

IMAGE

Figure 1. Ratio of several performance indicators to the performance of the QG_GYRE code. In each computer family, the fastest machines are listed to the left.

The machines are

From these results, it is clear that the ratio of QG_GYRE to STREAM Triad provides the least variability, both within and across processor families.

It is, of course, an abuse of the SPECfp92 metric to try to correlate this aggregate measure with performance on one particular application code. Some preliminary work to compare the QG_GYRE performance with the individual SPECfp92 tests has shown no good correlations across processor families (though good correlatations within several processor families were found).

The recommended procedure of determining which of the individual SPECfp92 tests correlates best with one's code is a non-trivial matter, for several reasons:

Therefore, I have made use of the SPECfp92 measure that is most widely available -- the "total" SPECfp92 value (being the geometric mean of the individual SPECfp92 ratios).

Super-Streaming Performance

We will assert that the performance of the "Stream Triad" benchmark is fundamentally related to the performance of large vector application codes. For any particular application code, the ratio of the obtained performance to the "Stream Triad" performance
	Application Performance
	------------------------
	Stream Triad Performance
gives information about such factors as data re-use or extra data movement. Ratios greater than one will generally indicate a degreee of data re-use, while ratios less than one indicate that the application's performance is limited by cpu or other resources.

For the machines that we have been looking at, the ratios are:

Cray C90		0.50
     Y/MP		0.58
     J916		0.86
     EL/98		1.09
	
HP 9000/755		2.21
       /730		2.63
       /720		2.80
	
IBM RS/6000-990		1.31
           -580		1.55
           -550		1.31
           -320		1.72
           -250		2.00
	
SGI Power Challenge	2.83
    Challenge		2.83
    Indigo/R4000	2.36
	
DEC 4000/710		2.74
    3000/500		2.46

The ratio appears to have some correlation with cache size, with the SGI Power Challenge, the SGI Challenge, and the DEC 4000/710 having multi-MB cache sizes and large ratios. The Cray EL/98 has an increased ratio because it has more floating-point pipelines per memory port than the other Cray machines.

Within each processor family, the ratio is almost constant (as noted above), and the constant is only weakly dependent on the family, with values very close to 2.5 for DEC, HP, and SGI, 1.5 for IBM, and 0.5-0.9 for the Cray machines. Because the high ratios are in part due to data re-use, they are a function of problem size, as will be investigated in the next section.

Effect of Problem Size

As in many computational fluid dynamics applications, ocean models can always use more resolution, so the model problem is easily scalable. The QG_GYRE code was run using grid sizes of 32x32, 64x64, 128x128, 256x256, 512x512, 1024x1024, and 2048x2048 (in some cases).

We define efficiency as the fraction of Peak MFLOPS obtained when running the QG_GYRE benchmark. Results for the Cray C90, SGI Power Challenge, and IBM RS/6000-990 are shown in Figure 2.


Author's Note: I plan to replace the Cray C90 results with Cray J90 results in the final version, since this would make the three machines listed comparable in price/cpu. Based on my preliminary results, the J90 and C90 curves will be quantitatively very similar.


IMAGE

Figure 2. Fraction of Peak performance obtained on the QG_GYRE benchmark as a function of problem size on three computers.

Comments: