The Cray T3D results are in good agreement with a linear combination of the reported write speed of 462 MB/s and the read speed of 320 MB/s (Charles Grassl, personal communication 1995). For ``copy'' we expect the average of 391 MB/s and we observe 384 MB/s. The decrease to 329 MB/s for ``scale'' is due to the part of the cost of the multiplication that is not overlapped with the data transfers. At 384 MB/s, the time to transfer two 8-byte words is 41.6 ns, while at 329 MB/s, the time is 48.6 ns. The difference of 7.0 ns shows that this extra cost is almost exactly one clock period (6.7 ns at 150 MHz).
There is also a slight, but noticeable difference between the performance of the Fortran code for the three-operand tests and assembly language versions of those tests. At the time these results were obtained (January 1995), the compiler was unable to correctly identify the opportunity for read-ahead mode (increasing the bandwidth from 228 MB/s to 320 MB/s), when multiple loads occurred in the loop, and it did not schedule the code to overlap operations optimally.
For the ``sum'' operation, the cost is 17.3 ns per iteration to write the output (8 bytes @ 462 MB/s), plus 70.2 ns per iteration to read the input (16 bytes @ 228 MB/s), plus 6.7 ns per iteration for the un-overlapped portion of the addition. The total of 94.2 ns per iteration compares to 107 ns for the assembly language version and 128 ns for the Fortran version. Thus we see a loss of two cycles per iteration (relative to this simple model) for the assembly language version and about five cycles per iteration for the Fortran version. ``Triad'' is observed to take 136 ns and 109 ns. Here the Fortran version requires an extra cycle for the multiply.