(no subject)

From: Robert.Bell@mel.dit.csiro.au
Date: Tue Oct 01 1991 - 00:12:08 CDT

Next message: Daniel.Stodolsky@cs.cmu.edu: "Re: Attainable memory bandwidth"
Previous message: Robert.Bell@mel.dit.csiro.au: "stream_s.f"
Next in thread: Save the Bay: shoot a developer: "(no subject)"
Maybe reply: Save the Bay: shoot a developer: "(no subject)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

John,
I tried your benchmark on a Cray Y-MP2/216, using just a single processor.
The cycle time is 6.0 ns (near enough).
I needed to make several modifications. I firstly deleted your second
function, and used Cray's supplied intrinsic. I then executed the program,
and obtained sensational but erroneous results. Because your program does not
use the array c, although defining it several times, the compiler seemed to
optimize out completely the do loops - not a bad feature. I inserted some
code to force the storage of the arrays, and turned on a compiler feature
to convert double precision features to real (64-bit of course on the Cray),
and eventually received the following results.

Timing calibration ; t = 0.6722976 clicks

Assignment: Rate = 2379.898425935 MB/s MFLOPS = 0.
Scaling: Rate = 2335.76778731 MB/s MFLOPS = 145.9854867069
Summing: Rate = 3220.624881743 MB/s MFLOPS = 134.1927034059
SAXPYing: Rate = 3264.136362561 MB/s MFLOPS = 272.0113635467

The theoretical peaks are 2666 Mbyte/s for the first two, and
4000 Mbyte/s for the second two tests. (2*8/6.0e-9 and 3*8/6.0e-9) / 1.0e6.

    I have been interested in memory transfer times for several years now.
I have run similar tests to yours, but also looking at relative memory
locations for the arrays, and scatter-gathers with various strides. On some
machines, the relative locations of the arrays (whether they start in the same
bank or not, etc) can have a substantial effect.
     Incidentally, I have some results from the Cray C-90 (=Y-MP16). The best
result for a loop like your assignment loop is 6937 Mbyte /s.

     Here are some speeds for various machines, expressed in Mbyte/s. The first
group is for a(i) = b(i+k-1), k = 1, 128, and a and b being forced to start
in the same bank. The two figures for each machine represent the slowest and
fastest times seen as k varied.
     The second group is for loops like a(i) = b(1 + k*(i-1)) for various values
of k.
     I will be happy to provide more information - I would like to eventually
publish the improved version of my benchmark code, which I am working on.

'a(i) = b(i+k-1) - offset$'
'C-90$', 6511., 6937. /
'Y-MP$', 2340., 2423. /
'X-MP$', 891., 1686. /
'205$', 1447., 1587. /
'C220$', 137., 150. /
'ETA 10P*$', 1400., 1506. /
'ETA 10P$', 619., 1305. /
'SG 280$', 17.7, 19.3 /
'$' /
'scatter-gather - stride$'
'C-90$', 533., 6936. /
'Y-MP$', 511., 2412. /
'X-MP$', 232., 1506. /
'205$', 185., 699. /
'C220$', 31., 162. /
'ETA 10P*$', 73., 565. /
'ETA 10P$', 25., 474. /
'SG 280$', 5.2, 16.4 /
'$' /

Robert Bell.

Next message: Daniel.Stodolsky@cs.cmu.edu: "Re: Attainable memory bandwidth"
Previous message: Robert.Bell@mel.dit.csiro.au: "stream_s.f"
Next in thread: Save the Bay: shoot a developer: "(no subject)"
Maybe reply: Save the Bay: shoot a developer: "(no subject)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:01 CDT