In contrast to the effective overlapping of latencies and transfers on the vector machines, most of the cached machines in current service have little or no effective overlapping of latencies or transfers with other latencies or transfers. Despite the large amount of academic research on improving the performance of cache-based systems ( e.g., the review in ), almost all of the systems here (which represent most of the machines sold in the U.S.) have standard single-threaded blocking caches. No pre-fetching, cache bypass, or other novel techniques are represented here, and of the machines tested, only a few of the newest entries have the ability to handle more than one outstanding cache miss, or are equipped with multiple load/store units on each cpu.
The complexity of implementing significant overlapping of memory references in the presence of virtual memory and with the requirement of cache coherence has apparently been prohibitive until the introduction of the of the very newest processors. This has perhaps been compounded by the use of the SPEC92 benchmark suite, for which the need for sustainable memory bandwidth can be eliminated by the use of large (4 MB) caches. The SPEC95 benchmark set is far more demanding in memory bandwidth and working set size, and is likely to encourage the industry to devote more emphasis toward improving this characteristic of future systems.
For most of the machines tested, the sustainable memory bandwidth is not easy to model because the required machine parameters are not all available. Given a model, one can typically estimate one missing machine parameter (such as latency), but it is then not possible to use the data to evaluate the accuracy of the model itself.
An attempt was made to fit some of the results by a simple ``latency
plus transfer time'' model. The ``transfer time'' is typically (but
not always) obtainable from the parameters of the system memory bus
and a knowledge of the cache line size. The ``latency'' can then
either inferred from the results, or determined independently using
lat_mem_rd program from Larry McVoy's lmbench suite.
A difficulty is that the results are quite sensitive to latency
estimates. Even in those cases where vendors provide some latency
information, there is no uniformly accepted definition of exactly what
the term ``latency'' refers to.
The IBM RISC System/6000 model 320 provides an illustrative example
from an older machine (first delivered in 1990).
The lmbench program
lat_mem_rd estimates a cache miss
latency of 660 ns, based on a string of consecutive dependent load instructions
(i.e., chasing a string of pointers). The observed memory
bandwidth for copy operations is 60 MB/s. Since this machine has a
write-allocate cache policy, the copy operation is actually moving 90
MB/s of data. Given the 64 byte cache line size, this corresponds to
1.47 million cache lines per second, or 678 ns per cache line. This
agrees with the latency measure by
lat_mem_rd to within one 50 ns
cycle, and indicates that there is effectively no overlap of latency
and/or transfer between consecutive (cache-missing) memory references.
The memory interface on the machine is 8 bytes wide, so a minimum of 8
clock periods (400 ns) is required to move a cache line across the
bus. This leaves 278 ns as an estimate of the ``latency'' required to
get the first word of the cache line into the cpu. This ``latency''
is about 5.5 clock periods, and is only about 70% of the transfer time.
The other models in the IBM RS/6000 line provide significantly increased bandwidth (up to a factor of 12 greater), with latencies that are similar to that found on the RS/6000-320. This increase is due to the combination of wider and faster memory interfaces (128 and 256 bits wide, and matched in speed to the cpu), and due to the introduction of a second independent integer unit on the Power2 processor. The two integer units operate independently, and allow significant overlapping of latencies on cache misses.
In contrast to the older uniprocessor results, many of the newer shared-memory machines have latencies that are considerably greater than the transfer time. For example, the SGI Power Challenge runs ``copy'' at 135 MB/s. Assuming a write-allocate policy (which is not clear from the vendor documentation, but seems likely), the machine is moving 200 MB/s across the shared bus. The bus moves 128-byte chunks of data (cache line ``sectors'') at a rate of 1200 MB/s, suggesting that the latency is roughly 5 times larger than the transfer time. Similarly, the DEC AlphaServer 8400-5/300 has a ``copy'' rate of 186 MB/s for a single cpu. Again assuming a write-allocate policy, this 280 MB/s data transfer rate is only 17% of the bus bandwidth , suggesting that the effective latency is roughly 5 times larger than the transfer time on this machine as well. This large ratio enhances parallel scalability, but at the cost of not optimizing for single-processor memory bandwidth.
Unfortunately, newer machines are characterized by increasingly complicated memory subsystems, and simple models are less and less useful for performance characterization. In such cases, direct measurement of sustainable memory bandwidth provides an important tool for estimating the ``balance'' of a machine, and thus gives some clues as to the likely source of bottlenecks.