Modelling Bandwidth on Vector Machines

Next: Cached-Memory Machines Up: Discussion Previous: Discussion

Modelling Bandwidth on Vector Machines

Traditional vector supercomputers, exemplified by the Cray Research Parallel Vector Processors (PVP's), are characterized by a (nominally) uniform access, shared, non-virtual memory. These memories are built of many banks (typically 64-1024), so that extremely high throughput can result from operating on banks in sequence. The Cray machines are also characterized by multiple (2 or 4) independent load/store units per vector pipeline. The interface between the cpus and shared memory is designed to allow concurrent operation of all cpus running at least two load/store operations each with minimal degradation in system throughput.

The ``peak'' memory bandwidth of these machines is simply the number of load/store units operating times their width times their clock speed. These numbers are easily obtained from the vendor literature. Actual performance is bounded above by this ``peak'' rating, and the performance degradation is due to several factors, including:

Start-up latency
Bank busy time
Interface throughput limitations

Start-up latency is relevant on most vector machines because of the finite vector lengths used (typically 64 or 128 64-bit words). However, when multiple vector instructions are executed ``end-to-end'' significant overlapping of the latency with the transfer time of the previous instruction is possible. On the Cray J916, for example, the ``effective'' latency for a simple vector copy operation is asymptotically only 4 cycles per 64-word vector load/store. Additional latencies accrue when arithmetic operations are included.

Bank busy time is the time after an access before a bank is ready to accept a new request for data. Accessing a single bank repeatedly (such as by power-of-two strides) can seriously degrade performance. Although individual applications can usually be written to avoid pathological memory reference patterns, execution of programs on multiple cpus will typically result in randomly occurring bank conflicts. These are workload-dependent, but typically result in bandwidth reductions in the 5-15% range on a fully configured Cray PVP machine. Bank busy time is also relevant if the machine is configured with so many cpus that the number of load/store units exceeds the number of banks times their bank busy time. In this case, some degree of memory bandwidth reduction will occur even for perfect access patterns.

Finally, the interface itself can be incapable of sustaining full performance when all cpus are attempting memory transfers on all of their load/store units simultaneously. Typically the Cray Research PVP machines are designed so that this admittedly extreme usage results in limitations only at the limit of the largest configurations, if at all. In the results shown here, this is only apparent when running 3-port operations simultaneously on all 16 cpus of a Cray C90.

For the results shown here, the vector machines typically sustain a large fraction of their ``peak'' bandwidth. This is primarily due to the effectiveness with which they overlap latencies and transfers.

Next: Cached-Memory Machines Up: Discussion Previous: Discussion

John McCalpin
Tue Aug 20 20:43:16 PDT 1996