Inexpensive Throughput Enhancement in Small-Scale Embedded Microprocessors with Block Multithreading: Extensions, Characterization, and Tradeoffs

J. W. Haskins, Jr., K. R. Hirst, and K. Skadron.
In Proc. of the 20th IEEE International Performance, Computing, and Communications Conference, April, 2001.

Abstract
This paper examines differential multithreading (dMT) as an attractive organization for coping with pipeline stalls in small-scale processors like those used in embedded environments. The paper proposes extensions to block multithreading to cope with data- and instruction-cache misses, and then explores some of the design tradeoffs that this enables. Results show that dMT boosts throughput substantially and can in fact replace dynamic branch prediction or data forwarding, or can be used to reduce the sizes of the instruction and data caches.

Block multithreading, described by Farrens and Pleszkun, is a technique to achieve high throughput from a single-issue microarchitecture by switching among multiple instruction streams in response to pipeline stalls. Although single-issue organizations are no longer used in high-performance processors, they remain common even in newly-designed processors for small-scale, embedded devices. Like the original description of block multithreading, dMT uses auxiliary pipeline registers to save the state of in-flight instructions. By coping with data- and instruction-cache misses, however, our implementation can attack all the major sources of pipeline stalls. Overall, we find that dMT can substantially lower the cost and complexity of microprocessors for embedded environments, especially environments for which throughput rather than speed is the primary concern. In addition, dMT is an attractive prospect for use in chip-multiprocessing environments.


Available in postscript or pdf