The intent of STREAM is not to suggest that ``real'' applications have no data re-use, but rather to decouple the measurement of the memory subsystem from the hypothetical ``peak'' performance of the machine. In this respect the test is quite complementary to the LINPACK benchmark test, which is typically optimized to the point that a very large fraction of full speed is obtained on modern machines, independent of the performance of their memory systems.
Each of the four tests adds independent information to the results:
STREAM dates back to a time when floating-point arithmetic was comparable in cost to memory accesses, so that the copy test was significantly faster than the others. This is no longer the case on any machines of interest to high performance computing, and the four STREAM bandwidth values are typically quite close to each other.
All results presented here employ 64-bit values. Most of the results are for the standard test case of two million element vectors, with no specified array offsets. A few of the results are ``optimized'' by running the code with many different array offsets and choosing the best results. This is acceptable because the intent of STREAM is to measure the best bandwidth available to the user using standard Fortran, not to get lost in the exploration of the myriad ways in which memory systems can deliver suboptimal performance.
It should be noted that in the memory bandwidth estimates, the STREAM benchmark gives ``credit'' for both memory reads and memory writes (in contrast to the standard usage for bcopy(), but only gives credit for the memory references explicitly specified.
Specifically, on machines with a write-allocate cache policy each of these operations will result in one additional 8-byte read per iteration in order to load the line containing the output vector into the cache. This data traffic is superfluous to the specified calculation and it would be inappropriate to credit a system for generating such extraneous memory traffic. (It would be similarly inappropriate to give credit for moving entire cache lines for non-unit stride applications that might only use one word per cache line in the actual calculation, though these applications are not considered here.)
Also note that the TRIAD operation is not identical to the level 1 BLAS SAXPY operation on machines with a write-allocate cache policy, since the TRIAD operation requires an extra memory read operation to load the elements of the ``a'' vector into cache before they are over-written. The BLAS 1 SAXPY kernel overwrites one of its input vectors, thus eliminating the need for the write allocate.