Simultaneous multithreading is a processor
design that combines hardware multithreading with superscalar processor
technology to allow multiple threads to issue instructions each cycle.
Unlike other hardware multithreaded architectures (such as the Tera
MTA), in which only a single hardware context (i.e., thread) is active
on any given cycle, SMT permits all thread contexts to simultaneously
compete for and share processor resources. Unlike conventional
superscalar processors, which suffer from a lack of per-thread
instruction-level parallelism, simultaneous multithreading uses
multiple threads to compensate for low single-thread ILP. The
performance consequence is significantly higher instruction throughput
and program speedups on a variety of workloads that include commercial
databases, web servers and scientific applications in both
multiprogrammed and parallel environments.
Simultaneous multithreading has already had
impact in both the academic and commercial communities. The project has
produced numerous papers, most of which have been published in journals
or the top, journal-quality architecture conferences, and one of which
was the most recent paper selected for the 25th Anniversary
Anthology of the International Symposium on Computer Architecture,
a competition in which the criteria for acceptance was impact. The SMT
project at the University of Washington has also spawned other
university projects in simultaneous multithreading. Lastly, several
U.S. chip manufacturers (Intel, IBM, Sun and Compaq (when it still
supported the Alpha microprocessor line) have designed and manufactured
SMT processors for the high-end desktop market. Several startups are
also building SMT processors.
- Simultaneous Multithreading: Maximizing On-Chip Parallelism (
Proceedings of the 22rd Annual International Symposium on Computer
Architecture, June 1995.
This paper demonstrated the feasibility of simultaneous multithreading
with simulation-based speedups on several SMT machine models.
It was selected to appear in the 25th Anniversary Anthology of the
International Symposium on Computer Architecture.
- Exploiting Choice: Instruction Fetch and Issue on an
Implementable Simultaneous Multithreading Processor (
Joel Emer, Henry Levy,
and Rebecca Stamm
Proceedings of the 23rd Annual International Symposium on
Computer Architecture, May 1996.
We designed SMT's microarchitecture, including a novel instruction
fetch unit that fetched instructions from the two "most profitable"
threads each cycle. In designing the microarchitecture, we met all
three of our original design goals: (1) that SMT exhibit increased
throughputs when executing multiple threads; (2) that SMT not degrade
single-thread performance; and (3) that SMT's implementation be a
straightforward extension of current wide-issue, out-of-order processor
technology. The latter two criteria were necessary to assure a smooth
transition to the commercial world.
This paper was selected for Readings in Computer Architecture,
ed. M.D. Hill, N.P. Jouppi and G.S. Sohi, Morgan Kaufman, 1999.
- Compilation Issues for a Simultaneous Multithreading
Jack Lo, Susan Eggers, Henry Levy, and
Proceedings of the First SUIF Compiler Workshop, January
- Converting Thread-Level Parallelism Into Instruction-Level
Parallelism via Simultaneous Multithreading
Joel Emer, Henry Levy,
Rebecca Stamm, and
ACM Transactions on Computer Systems, August 1997.
Single-chip multiprocessors (CMP, or 2 to 4 superscalar processors on a
single chip) are another emerging processor design that will likely
compete with simultaneous multithreading in the commercial market 3 to
4 years from now. Our experiments indicate that SMT processors can
outperform CMPs when executing coarse-grain parallel programs (the
natural workload for CMPs) by an average of 60%. SMT has the
performance advantage, because it dynamically allocates hardware
resources to whatever threads need them at the time; the CMP, on the
other hand, statically partitions these same hardware resources for all
threads, for all time. This paper carefully quantifies CMP's loss of
performance due to the static partitioning, on a resource-by-resource
basis. In the end, we show that even giving each processor on the CMP
the hardware resources of a single SMT processor can't beat what
dynamic partitioning buys SMT.
- Simultaneous Multithreading: A Platform for Next-generation
Joel Emer, Henry Levy,
Rebecca Stamm, and
IEEE Micro, September/October 1997.
This paper evaluate SMT with the experience of two years of design and
performance analysis under our belts. It describes the
microarchitecture, including the extended pipeline for accessing the
multi-context register file and the custom instruction fetch unit, and
presented our most current performance studies at the time, including a
comparison to wide-issue superscalars, traditional multithreading, and
chip multiprocessors, executing both a parallel and a multiprogrammed
workload comprised of the SPEC95 and SPLASH-2 applications. The paper
was part of the IEEE Computer special series on "how to use a billion
- Tuning Compiler Optimizations for Simultaneous Multithreading
Proceedings of the 30th Annual International Symposium on
Microarchitecture, December 1997, pages 114-124.
Since simultaneous multithreading changes several fundamental
architectural assumptions on which many machine-dependent compiler
optimizations are based, such as the extent to which threads share the
cache hierarchy and the importance of hiding latencies via code
scheduling, it stands to reason that compiler optimizations that rely
on these assumptions may need to be applied differently on an SMT. We
validated this hypothesis for three optimizations that are commonly
used and normally very profitable: loop distribution, loop tiling and
software speculation. We found that when compiling programs for SMT,
the optimizations either had to be coupled with radically different
policies than are currently used, or not used at all.
- An Analysis of Database Workload Performance on Simultaneous
Proceedings of the 25th Annual International Symposium on
Computer Architecture, June 1998.
In addition to the multiprogramming and parallel workloads, we also
studied SMT executing a commercial database workload (Oracle) to gauge
its performance as a database server. Commercial workloads are a
challenge for all computers, because they have extremely bad memory
subsystem performance. We devised operating systems and
applications-level mechanisms that improve SMT's performance on
databases by reducing inter-thread conflicts in the memory hierarchy.
The result was a 3-fold improvement over a wide-issue superscalar when
executing Oracle transactions.
- Supporting Fine-Grain Synchronization on a Simultaneous
Proceedings of the 5th International Symposium on High
Performance Computer Architecture, January 1999.
The efficiency of a processor's synchronization mechanism determines
the granularity of parallelism of programs that run on it.
Synchronization on conventional multiprocessors is fairly costly,
because communication among the parallel threads must take place via
memory. Consequently, applications must be parallelized on a fairly
coarse-grain level. Because parallel threads are resident on an SMT
processor, synchronizing them can be done locally (on the processor)
rather than through memory. SMT synchronization is sufficiently
light-weight that it both improves the synchronization performance of
current, coarse-grain parallel programs and permits fine-grain
parallelization of new codes that cannot be parallelized with current
- Software-Directed Register Deallocation for Simultaneous
IEEE Transactions on Parallel and Distributed Systems,
We designed operating systems and compiler-directed architectural
techniques to deallocate SMT registers earlier than can be done with
current register renaming hardware. The mechanisms free idle hardware
contexts (the thread has terminated) and registers in active contexts
after their last use. The performance consequence is either a reduction
in register file size (useful if the register file access determines
the processor cycle time) or an increase in performance for a given
file size. The compiler-based techniques have much wider applicability
than SMT processors -- they also improve performance on any
- Characterizing Processor Architectures for Programmable
Marc E. Fiuczynski
Proceedings of the 2000 International Conference on
This paper characterizes current and future network processor
workloads and concludes that SMT is better suited to a network node
than aggressive out-of-order superscalars, fine-grain multithreaded
and chip multiprocessors (CMP).
- An Analysis of Operating System Behavior on a Simultaneous
Proceedings of the 9th International Conference on Architectural
for Programming Languages and Operating Systems,
This paper presents our first analysis of operating systems execution
simultaneous multithreaded processor. To carry out this study, we
modified the Digital 4.0 Unix operating system to run on a simulated
SMT CPU based on a Compaq Alpha processor. We executed this environment
by integrating our SMT Alpha instruction set simulator into the SimOS
machine simulator. As our principle workload, we executed the Apache
Web server running on an 8-context SMT under Digital Unix. Our results
demonstrate the micro-architectural impact of an OS-intensive workload
on a simultaneous multithreaded processor, and provide insight into the
OS demands of the OS intensive Apache Web server.
- Thread-Sensitive Scheduling for SMT Processors
University of Washington Technical Report 2000-04-02.
This paper examines thread-sensitive scheduling for SMT processors.
When more threads exist than hardware execution contexts, the operating
system is responsible for selecting which threads to execute at any
instant, inherently deciding which threads will compete for resources.
Thread-sensitive scheduling uses thread-behavior feedback to choose the
best set of threads to execute together, in order to maximize processor
throughput. We introduce several thread-sensitive scheduling schemes
and compare them to traditional oblivious schemes, such as round-robin.
Our measurements show how these scheduling algorithms impact
performance and the utilization of low-level hardware resources. We
also demonstrate how thread-sensitive scheduling algorithms can be
tuned to trade-off performance and fairness. For the workloads we
measured, we show that an IPC-based thread-sensitive scheduling
algorithm can achieve speedups over oblivious schemes of 7% to 15%,
with minimal hardware costs.
- Mini-threads: Increasing TLP on Small-Scale SMT Processors
Proceedings of the International Conference on High-Performance
Computer Architecture, February 2003.
This paper presents the mini-thread architectural model for increasing
thread-level parallelism on SMT processors, particularly small-scale
implementations which, because of their size, can be thread-starved. It
also empirically demonstrates the performance trade-off for one
implementation of mini-threads, that in which the architectural
register file is partitioned among all executing threads in a hardware
context, showing that the benefits of the additional TLP far outweigh
the cost of additional spill code generated because each thread has
fewer architectural registers available to it.
- Improving Server Software Support for Simultaneous
Symposium on Principles and Practice of Parallel Programming ,
This paper evaluates how SMT's hardware affects traditional support for
server software, in particular, memory allocation and synchronization,
for three different server models. The results demonstrate how a few
simple changes to the run-time libraries can dramatically boost
multi-threaded server performance on SMT, without requiring
modifications to the applications themselves.
- An Evaluation of Speculative Instruction Execution on
Transactions on Computer Systems,
- Exploiting Thread-Level Parallelism on Simultaneous
Jack Lo's thesis, 1998.
- An Analysis of Software Interface Issues for SMT Processors
Josh Redstone's thesis, 2002
This page maintained by Susan Eggers
eggers [at] cs [dot] washington [dot] edu