Stream Memory Controller (SMC)




People | Publications | Prototype Performance



The SMC implements dynamic access ordering -- the hardware reorders stream accesses at runtime to improve memory system performance. The compiler detects the streamable loops, and generates code to transmit information about the streams to the hard ware at runtime. The SMC then prefetches read operands and buffers write operands, issuing the accesses in an order that takes advantage of both the memory system architecture (i.e., it tries to keep all banks busy in an interleaved system) and the DRAM c omponent characteristics (our initial implementation exploits fast-page mode). The stream buffers look like FIFOs to the processor; each stream is mapped to one FIFO. The CPU simply issues its accesses in the natural order of the computation, referencing the head of the appropriate FIFO in order to access the next element of a given stream.

The project has involved compiler development, derivation of analytic performance bounds, simulation of the proposed design, design and fabrication of two VLSI implementations of the SMC, a complete board-level computer integration, and performance evalua tion of the result. During the past year a full SMC system was implemented and tested. The design was performed using Mentor Graphics Corporation's Design Architect for schematic capture, VHDL for state machine description, and Cascade Design Automation's Epoch for synthesis and silicon compilation. The first version of the chip is a 16-bit, 4-way, bit-sliced system consisting of approximately 71,000 transistors in a 132-pin package. This version has 4 fixed-size FIFOs (16 elements each). The second versi on of the chip provides FIFOs, the depth of which can be set to powers of 2 from 8 to 128. Both were fabricated through MOSIS using the 1.2 micron, 3-level metal Hewlett-Packard 34 process. Both versions of the SMC routinely deliver the bandwidth predicte d by our analytic and simulation models on the benchmarks studied.



The SMC Team

Current Members

Alumni

  • Assaji Aluwihare
  • Charlie Hitchcock
  • Trevor Landon
  • Sean McGee
  • Steve Moyer
  • Chenxi Wang


Publications

Conference Papers

Other


Prototype Performance

ASIC Version 1

These graphs illustrate the measured performance of our prototype SMC system on several benchmark kernels on vectors of 16 to 8192 elements. The vectors used in each experiment are of equal length, and share no DRAM pages in common. All vectors are unit stride, unless otherwise indicated. Performance is given as a percentage of the peak system bandwidth exploited for each benchmark. (Click here to see performance given as average number of cpu cycles per stream access.) The green lines labeled "performance limit" indicate the attainable bandwidth for the computation: these limits are due to SMC startup costs, unavoidable page misses, or the cost of moving data between the SMC and CPU chips. The black lines indicate the performance of our access ordering hardware. The red lines indicate the performance measured when using "normal" caching load instructions to access the stream data in the i860's own cache-optimized memory; and the blue lines indicate the performance measured when using the i860's non-caching pipelined floating point load (pfld) instruction.






Performing the computation in the natural order with caching accesses yields less than 32% of the system's peak bandwidth for all access patterns. The effective bandwidth delivered by the SMC for these kernels is between 202% and 305% of that delivered by normal caching. Looking at it from a different perspective, the SMC brings the average cycles per memory access very close to the minimum one cycle for long-vector computations. In contrast, the average number of cycles per access when using the cache is more than three. The disparity between SMC and cache performance would be even more dramatic for non-unit stride computations, since in this case each cache-line fill would fetch unneeded data. For instance, at a stride of five, the SMC could deliver 12 times the effective bandwidth of the cache (for an average of one cycle for each element accessed via the SMC, versus over 12 cycles for each one accessed through the cache).

When non-caching instructions are used (e.g., if the programmer does not want stream data to entirely fill the cache), performance is generally even worse than when using caching loads. The exception to this is the scale benchmark: this kernel operates on a single vector, thus it accesses every element in a single DRAM page before switching to a different page. In contrast, each cache-line fill incurs a DRAM page miss for the first word in the line, regardless of whether or not it is on the same DRAM page as the previously loaded cache line. Performance of the benchmarks using the pfld instruction could take more advantage of fast-page mode -- and thus improve performance -- by unrolling the loops and grouping accesses to each stream, so that several accesses that hit the current page are issued after every DRAM page miss.


ASIC Version 2


The vectors used in the experiments represented in this section were 128 elements long (where each element is 64 bits). Unless otherwise indicated, they have unit stride. In the following graphs, the red lines labeled "limit" indicate the attainable bandwidth for the computation. The blue lines illustrate the results of our functional simulator. The black lines indicate the performance of version 2 of the SMC ASIC. The green lines indicate the maximum measured performance for 128-element vectors when using "normal" caching load instructions to access the stream data in the i860's own cache-optimized memory; and the purple lines indicate the highest performance measured when using the i860's non-caching pipelined floating point load (pfld) instruction. These latter two values are independent of FIFO depth, but we represent them with a line on these graphs for purposes of comparison.







We built the MSU so that it fills a whole FIFO at a time -- this lets it exploit fast-page mode as much as possible, but it also creates a start-up cost for using the SMC. For computations that read more than one stream, the processor must wait for the fi rst element of stream s while the MSU fills the FIFOs for the first (s-1) read streams. By the time the MSU has provided all operands for the first loop iteration, it will have prefetched data for many future iterations, so the processor will not stall a gain soon. So even though deeper FIFOs allow the MSU to get more data from a DRAM page each time it's made current, they cause the processor to wait longer at startup. These graphs illustrate the net effect of these competing factors.

Short-vector computations have fewer total accesses over which to amortize startup and page-miss costs. For these computations, initial delays can represent a significant portion of the computation time; this is easy to see in the performance curves for t he short-vector computations that read two or more streams (daxpy, swap, tridiag, and vaxpy). The copy and scale kernels incur no initial delay, since they only read one stream. If we plotted performance for deeper FIFOs, the curves would be flat: bandwid th remains constant once FIFO size exceeds the vector length.

The disparity between SMC and cache performance is even more dramatic for non-unit stride computations, where each cache-line fill fetches unneeded data. The bottom graph in this section shows percentages of peak bandwidth delivered by the SMC versus cach ing and non-caching accesses for the vaxpy kernel with stride-five vectors. For these vectors, the SMC delivers between 10.4 and 13.2 times the effective bandwidth of the cache. For computations using the i860's pipelined, non-caching load, performance fo r non-unit strides is approximately half that for unit strides, since quadword instructions can no longer be used to access two elements every two cycles. In contrast, SMC performance is relatively insensitive to changes in vector stride, as long as the s tride hits all memory banks and is small relative to the page size. For instance, SMC vaxpy performance for stride five is essentially identical to that for stride one, whereas the cache and pfld performances decrease by factors of 3.7 and 1.8, respective ly.




Computer Architecture | Computer Science | Electrical Engineering

comments to Sally McKee, mckee@virginia.edu