re: memory bandwidth

From: Robert Bell (Robert.Bell@mel.dit.csiro.au)
Date: Wed Oct 02 1991 - 17:57:00 CDT


 John,
      I did mean but did not say that the results can be a factor of two worse
 with different alignments. This happens on an X-MP, but not on the Y-MP, where
 the memory is better.
      It would be nice to add a column of bytes/cycle to your table of results.
 The C-90 theoretical peak is 3*2*8*16 = 768 bytes/cycle, or 1024 bytes/cycle
 if you include the i/o ports as well.
      The results from the Convexes will be interesting. Since there is only
 one port to memory for each CPU, the rate is 8 bytes/cycle per CPU, or
 32 bytes/cycle for C2 and 64 bytes/cycle for C3. Thus the SAXPY operation
 can proceed only at one third of the speed the CPU is capable of, since the
 data transfer is the bottleneck. The Linpack 1000 results are obtained by
 rewriting the code with much unrolling to have only one vector reference
 for each two arithmetic operations.
      I now have good autotasking results from our Y-MP2/216.
      Alignment and stride tests are fun! You can make the Y-MP slow down by
 a factor of 15 (12 on later models) by striding all the vectors through with
 a stride of k*(number of banks). On the Convexes (and I suspect the IBM 3090
 VF), with large strides you hit cache problems in a big way.
      Cheers,
             Rob.
 P.S. Another useful figure would be peak memory transfer rate / peak meaflops,
 as a measure of the memory performance relative to what most people look at.



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:01 CDT