QG_GYRE and Memory Bandwidth

The QG_GYRE code is a benchmark version of a two-dimensional ocean model. The model solves a second-order finite-difference approximation to the vorticity equation that governs the vertically averaged, large-scale oceanic flow. From the computational point of view, the calculations are strongly dominated by unit-stride vector operations.

One major bottleneck in the code is the solution of a linear, constant-coefficient elliptic equation to recover the streamfunction from the new vorticity field at each time step. This equation is solved by Fourier transformation along the second index of the array (vectorizing over the first index), followed by solution of the resulting independent tridiagonal matrices. This second step has a data dependency on the first array index. On Cray machines, the compiler automatically vectorizes on the second index (using non-unit strides). On hierarchical-memory machines, it is generally more efficient to keep the data dependence in the inner loop and use unit strides, though for small array sizes (or very deep pipelines) it is preferable to place the second index innermost and suffer the resulting increase in cache misses.

The QG_GYRE benchmark code has been in existence for several years, and has undergone several revisions. The results here are from all three versions of the code (QGBOX, QGBENCH, and QG_GYRE), but they have been editted to remove cases where extensive "tweaking" was performed or where the run size was too small relative to the machine's cache.

Correlations

The correlations looked at here are between the QG_GYRE performance in MFLOPS (based on operation counts from the Cray Hardware Performance Monitor) and:

STREAM Triad MFLOPS
SPECfp92
Peak MFLOPS

The STREAM Triad MFLOPS represent the sustainable memory bandwidth for uncached operands. The SPECfp92 measure is not in units of MFLOPS, but the size is very close to the "Peak MFLOPS" and so it is convenient to use it without further scaling. The Peak MFLOPS is the "guaranteed not to exceed" specification from the manufacturer.

Figure 1. Ratio of several performance indicators to the performance of the QG_GYRE code. In each computer family, the fastest machines are listed to the left.

The machines are

Cray: C90, Y/MP, J916, EL-98
HP: 9000/755, 9000/730, 9000/720
IBM RS/6000: 990, 580, 550, 320, 250
SGI: Power Challenge, Challenge (150 MHz), Indigo R4000 (100 MHz)
DEC: 4000/710, 3000/500

From these results, it is clear that the ratio of QG_GYRE to STREAM Triad provides the least variability, both within and across processor families.

It is, of course, an abuse of the SPECfp92 metric to try to correlate this aggregate measure with performance on one particular application code. Some preliminary work to compare the QG_GYRE performance with the individual SPECfp92 tests has shown no good correlations across processor families (though good correlatations within several processor families were found).

The recommended procedure of determining which of the individual SPECfp92 tests correlates best with one's code is a non-trivial matter, for several reasons:

the individual SPECfp92 codes are not easily available to non-members (such as this author and his institution),
the individual results are not as easily available as one might hope, especially when results for many different machine models are required,
there is increasing concern that the SPECfp92 benchmarks are too small, giving unrealistically inflated solution rates on machines with cache sizes greater than 1 MB.

Therefore, I have made use of the SPECfp92 measure that is most widely available -- the "total" SPECfp92 value (being the geometric mean of the individual SPECfp92 ratios).

Super-Streaming Performance

We will assert that the performance of the "Stream Triad" benchmark is fundamentally related to the performance of large vector application codes. For any particular application code, the ratio of the obtained performance to the "Stream Triad" performance

	Application Performance
	------------------------
	Stream Triad Performance

gives information about such factors as data re-use or extra data movement. Ratios greater than one will generally indicate a degreee of data re-use, while ratios less than one indicate that the application's performance is limited by cpu or other resources.

For the machines that we have been looking at, the ratios are:

Cray C90		0.50
     Y/MP		0.58
     J916		0.86
     EL/98		1.09
	
HP 9000/755		2.21
       /730		2.63
       /720		2.80
	
IBM RS/6000-990		1.31
           -580		1.55
           -550		1.31
           -320		1.72
           -250		2.00
	
SGI Power Challenge	2.83
    Challenge		2.83
    Indigo/R4000	2.36
	
DEC 4000/710		2.74
    3000/500		2.46

The ratio appears to have some correlation with cache size, with the SGI Power Challenge, the SGI Challenge, and the DEC 4000/710 having multi-MB cache sizes and large ratios. The Cray EL/98 has an increased ratio because it has more floating-point pipelines per memory port than the other Cray machines.

Within each processor family, the ratio is almost constant (as noted above), and the constant is only weakly dependent on the family, with values very close to 2.5 for DEC, HP, and SGI, 1.5 for IBM, and 0.5-0.9 for the Cray machines. Because the high ratios are in part due to data re-use, they are a function of problem size, as will be investigated in the next section.

Effect of Problem Size

As in many computational fluid dynamics applications, ocean models can always use more resolution, so the model problem is easily scalable. The QG_GYRE code was run using grid sizes of 32x32, 64x64, 128x128, 256x256, 512x512, 1024x1024, and 2048x2048 (in some cases).

We define efficiency as the fraction of Peak MFLOPS obtained when running the QG_GYRE benchmark. Results for the Cray C90, SGI Power Challenge, and IBM RS/6000-990 are shown in Figure 2.

Author's Note: I plan to replace the Cray C90 results with Cray J90 results in the final version, since this would make the three machines listed comparable in price/cpu. Based on my preliminary results, the J90 and C90 curves will be quantitatively very similar.

Figure 2. Fraction of Peak performance obtained on the QG_GYRE benchmark as a function of problem size on three computers.

Comments:

The increase in efficiency with problem size on the Cray C90 is typical of vector applications on vector architectures. For grid sizes of over 128, the vector pipelines on the C90 (which operate on vectors of length up to 128) are generally full.
The performance decrease on the SGI Power Challenge is clearly a cache-related effect. The data set size of the benchmark is approximately 80*M**2, where M is the grid size, so the 4 MB cache on the machine can contain the whole working set (modulo associativity conflicts) for grid sizes of less than ~200. The performance degrades more slowly than one might expect because the largest single piece of the computation (the elliptic equation solver) performs about (5+5 log_2(N)) operations per grid point, while using only 32*M**2 bytes of memory. Thus this important of the calculation is cache-contained for the 256x256 case and hase significant data re-use for the 512x512 case.
The performance curve on the IBM RS/6000-990 is nearly flat. The Dcache size is only 256 kB, and so all sizes greater than 64x64 have minimal data re-use. The slight decrease in efficiency for the cases larger than 256x256 is due to extra TLB misses in the larger cases. The TLB on the RS/6000-990 has 512 entries which each map 4 kB pages, so working sets over 2 MB (grid size of ~160) will result in occassional TLB miss penalties that do not occur in the smaller cases.