This report has presented sustainable memory bandwidth measurements for a large variety of systems obtained over the last five years.
Analysis of the data reveals strong systematic variations in the memory bandwidth and machine balance according to to the type of memory system. In particular, hierarchical-memory, shared-memory systems are generally strongly imbalanced with respect to memory bandwidth, typically being able to sustain only 3-10% of the memory bandwidth needed to keep the floating-point pipelines busy. In contrast, vector shared-memory machines have very low machine balance parameters and are typically capable of performing approximately one load or store per floating-point operation.
The recent trends in machine balance observed in this data strongly suggest that an emphasis be placed on increasing sustainable memory bandwidth. In the absence of unexpectedly large changes in the economics of memory technology (as a function of speed), latency will grow increasingly dominant in the memory bandwidth equation. This then implies a need for fundamental architectural changes to enable the systems to use all available information about data access patterns in order to effectively apply latency tolerance mechanisms (e.g. pre-fetch, block fetch, fetch with stride, cache bypass, etc.). These features are perhaps best understood under the umbrella ``data vectorization'' . At the same time, these systems should not preclude the use of ``dumb'' caches in the memory hierarchy when the memory access patterns are not visible to the compiler -- an approach which has clearly been successful across a broad range of applications.