Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. Previous work has demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with the same hardware resources. This speedup is enabled by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor. We examine several heuristics that allow us to identify and use the best threads for fetch and issue, and show that such heuristics can increase throughput by as much as 37%. Using the best fetch and issue alternatives, we then use bottleneck analysis to identify opportunities for further gains on the improved architecture.
To get the PostScript file, click here.