Simultaneous Multithreading Project

Currently under reconstruction

Research Topics:

We have just obtained new funding to pursue the research topics listed below. We are, of course, keen to hear about other topics you might be interested in. Depending upon your stage of career, they could result in quals credit, a masters degree, dissertation research, undergraduate project credit or just some experience in experimental architecture and operating systems research. If the project is too big for what you have in mind, we can probably whittle it down.

If one of them interests you, just stop by to talk.

Research Objectives
Research in Detail

Last modified on

Research Objectives

While many SMT architectural issues have been resolved, there is one major area that has not been explored at all: the design and performance of operating systems for multithreaded processors (SMT included). With one lone exception, Josh Redstone's initial study that examines operating systems behavior on SPEC and web service workloads (that will appear in this year's ASPLOS), all SMT research has centered on user-level applications executing on user-level instruction-set simulators. Those few operating systems functions that were examined (for example, page mapping), were modeled and emulated at a very high level; most fundamental operating system tasks, such as interrupt handling, I/O, networking, and scheduling, were not modeled at all. Now that we have the infrastructure to simulate the operating system in detail (i.e., we have ported SMT to the SimOS simulator), we are poised to begin operating systems-oriented research in earnest.

Despite numerous hardware advances of the last several decades, the basic structure of operating systems has remained unchanged. Our goal is to reexamine operating systems design to capitalize on several hardware capabilities that are unique to SMT and to meet the demands of today's request-based parallel workloads, such as databases and web servers, that rely heavily on operating systems support. We will focus on the interaction of operating systems activity and several of SMT's features that are the source of its current performance advantage: SMT's different execution model (relative to today's superscalars and shared memory multiprocessors), its hardware support for lightweight synchronization, the fine-granularity with which its threads dynamically share most hardware resources, and its processor-resident thread state.

Why SMT Could Change the Operating System

Because an operating system is the lowest software layer that executes on and interfaces directly with hardware, it is very tied to the design and organization of the processor. Within the OS, key software strategies and policies are designed to take advantage of or to compensate for particular architecture features. Since most commercial server-based operating systems execute on small multiprocessors (either shared-memory machines, distributed-memory machines, or both), they have been designed with these underlying architectures in mind. Simultaneous Multithreading changes many of these architectural assumptions. For example, in multiprocessor environments, threads execute on different processors, and each thread has its own private copy of many hardware data structures, such as the L1 caches and TLBs. Any data sharing involves expensive interprocessor communication, and consequently operating system (and compiler) strategies are designed to discourage or eliminate such sharing. On an SMT processor, on the other hand, the environment is largely the opposite from that provided by shared-memory machines. All SMT threads execute on the same processor, and most hardware, including the L1 cache and TLB, is dynamically shared by the threads on a cycle-by-cycle basis. Using this hardware involves processor-local accesses; consequently, sharing (and even false sharing) of the resources among threads is cheap and should be encouraged.

A second difference between multiprocessors and SMT is the granularity of thread synchronization. Multiprocessors must synchronize through the closest (to the processor) memory that the individual processors share, at best the L2 caches, but often main memory. The long latencies of these memories dictates the high overhead of multiprocessor synchronization operations and consequently the coarse granularity of parallelism. Since threads execute on the same processor in an SMT, thread synchronization can take place within the processor, rather than through memory. Consequently, it is one or two orders of magnitude faster than traditional shared-memory synchronization. We have demonstrated the potential of SMT's feather-weight synchronization to exploit fine-grained parallelism in applications. For the research in this proposal, we wish to explore the use of fine-grained synchronization to improve parallelism and, hence, performance, within the operating system. We believe that SMT is a promising architecture for a high-throughput message processor in particular, and we will explore the potential for increased parallelism in the OS networking/messaging code. We will also target the disk and file system, particularly on large servers handling multiple disks. Overall, our research objective is to build a truly parallel operating system, rather than a (conventional) operating system that simply supports parallel application programs.

In summary, we will explore the interactions between operating system design and multithreading architecture, from two points of view. From one side, we will examine how a multithreaded processor can best support operating systems needs; and from the other side, how an operating system can best take advantage of the unique characteristics of a multithreaded architecture.

Research in Detail

Below is a brief description of a few of the specific activities we have in mind. New entries are in red.

Our most basic research will study the impact of operating systems code on simultaneous multithreaded execution. We will determine where the operating system spends its time, to identify the bottlenecks of an OS executing on an SMT processor. We hope to discover particular places where current operating system design is inappropriate for multithreaded technology. This work will focus on several different aspects of OS/SMT interaction. Beginning with targets that previous studies have shown are degraded by operating system activity (relative to user-level code), e.g., branch prediction hardware, caches and TLBs. It is important that we verify that the operating system does not negatively impact the high throughput improvements that the user-mode SMT simulations have demonstrated in the past. However, if the OS does have a negative impact on throughput, we will investigate and understand the causes and address them with either changes to the SMT microarchitecture or operating systems policies. Josh Redstone

Interthread conflicts. One side effect of SMT's multithreaded execution is an increase in miss rates in all hardware data structures, such as the cache hierarchy, the TLBs, and the branch prediction logic. The additional misses occur because these structures hold the context of multiple working sets, rather than just one. For example, SMT doubles L1 cache miss rates of the SPEC98 workload. We need to determine where interthread conflicts exist in privileged operating systems code and devise hardware or software techniques to reduce them. Interthread conflicts are an important source of performance degradation to eliminate. Conflicts of this nature hobbled SMT database performance until we instituted a change in page mapping strategy.

Exploiting lightweight parallelism within an operating system. While operating systems are currently multithreaded, the multithreading is rather coarse-grain. In general, each OS thread represents an independent task that the operating system is concurrently processing, such as an I/O request. Using SMT's lightweight thread creation and featherweight synchronization primitives, we will see where decomposing the coarse-grain parallelism into finer-grain tasks will improve OS (throughput) performance. This may involve significant rethinking of conventional operating systems structures and algorithms, since it involves not merely dicing up operating systems activities into smaller pieces, but constitutes a shift in the purpose of multithreaded operating systems from managing concurrent, independent threads to improving performance through parallelism. In this work we will initially focus on the network and disk subsystems.

Renaming registers. Our previous work has shown that with compiler- and operating-systems-directed support, SMT can deallocate renaming registers earlier than when just relying on the hardware alone. The result was a reduction in the size of the register file needed to achieve current SMT performance. This could be crucial if we are to maintain register file access within an acceptable cycle time. However, the work was only done on an applications-based workload. We'd like to incorporate support for early renaming register deallocation into the SimOS SMT simulator and gauge its effect when executing the operating system and OS-intensive applications.

An SMT web server. Web service is a particularly interesting application for SMT, because it is highly multithreaded and makes extensive use of the operating system for both network and file system operations. In addition, web servers have been developed with different underlying goals, and therefore have different implementations. As an example, Apache was designed for portability and Flash for performance. There are several issues here. The first addresses how well SMT performs with each type of Web server software. Another explores possible architectural or microarchitectural changes to address any performance problems we find with supporting Web service on an SMT. A third compares the structure and performance of both web server designs to each other and also across a few key architectures, namely, single-threaded uniprocessors and SMT processors. The Apache Web server has already been installed and experimented with on our SMT/SimOS base; Flash is still to be done. Dimitriy Portnov

An MP of SMTs. A shared-memory multiprocessor, where each of the CPUs is an SMT processor, will enable SMT to scale beyond the number of hardware contexts that can be supported on a single chip. This leads to new issues of task decomposition, thread scheduling, resource allocation, hybrid synchronization and coherency schemes, etc. Our hardware and software solutions will need to be particularly creative in this environment, because MPs and SMT often have opposing strengths. For example, sharing and interprocessor communication on an MP is costly, whereas sharing and interthread communication on an SMT is extraordinarily cheap. Whereas our bias has always been to build the highest performance SMT, in this environment we might also consider a CMP (chip multiprocessor) of small-scale SMTs, which may reduce hardware design time and/or power consumption. Since simulating an MP of SMTs will be computationally expensive, this project will require some innovative simulation design.

Other platforms. Multithreading might be useful in domains other than high-end servers. We'd like to examine a diverse collection of dedicated/embedded environments, such as network interface cards, signal processing applications and robots, to gauge their appropriateness for SMT techniques. For example, multithreading is a natural technique to apply in the CPU for an interface card that handles high-throughput messaging or routing. In this case, the thread contexts provide low-latency response to message arrival and a hardware-based mechanism for message multiplexing. Such embedded applications require highly efficient, application-specific operating systems with a very low-overhead design. We will also investigate the appropriateness of active network software in SMT processors and examine real-time issues in signal processing applications. This work will involve reexamining the design of embedded processor microarchitectures.

Dedicated hardware contexts. We want to examine whether it is possible to effectively reserve hardware contexts for specific, commonly occurring functions that need to interact with an executing main program. Candidates include:

dynamic compilers that execute as co-routines with the application they are dynamically specializing and then interrupt the application to begin executing the code they have just dynamically generated;

, profilers that gather information that can be used to optimize programs as they execute, e.g., to re-layout instructions and data or specialize code to take advantage of value locality, garbage collectors, and helper functions that compute different outputs for signal processing applications that meet tighter real-time constraints. There could be lots of examples here -- use your imagination! Which functions should be chosen and why? How many? What performance benefit do they reap? Are special hardware or software techniques necessary to support them?

Debugging on an SMT. We expect that a significant challenge for SMT-based systems will be debugging the heavily multithreaded applications and operating systems. SMT's potential for extremely fine-grained synchronization will likely exacerbate the existing difficulty of multithreaded programming and debugging. The main issue is what the SMT debugging infrastructure should look like. It may be possible to influence the design of the SMT scheduler to simplify debugging. One option would be to adopt the dedicated hardware context model mentioned above and execute a debugging thread in a separate hardware context that monitors application threads, reporting on their progress or triggering breakpoints on certain conditions.

OS functions in user space. A trend in high-performance servers and I/O is towards event-driven (or upcall-driven) architectures. SMT can potentially improve on this trend, by dedicating an SMT thread to handle an interrupt; the dedicated thread could push an event that results from a hardware interrupt upwards through the operating system, and perhaps even into user-level code, causing application-level processing to occur. This design looks similar to active messages, but with a dedicated SMT thread as the active context. Because of the ability to simultaneously execute multiple SMT threads, however, the limitations of current active message handlers (little functionality because of the low-latency, non-blocking requirements) could be relieved, vastly simplifying the task of using them. The extension of this idea would be to design and build an upcall-oriented OS, in which SMT threads emerge out of the operating system to dispatch events to applications. With such an operating system, applications would be structured purely as event handlers or upcall handlers, hopefully resulting in much higher performance.

Hardware assists. SMT produces better instruction throughput than other processors despite not yet having some hardware assists that are standard in most microarchitectures, such as a victim cache. Some SMT-specific considerations may apply here. For example, is the size of the victim cache a function of the number of threads, the size of the L1 data cache, or something else? How does SMT victim cache performance compare to that on a superscalar? How does its optimal size compare to that needed by a superscalar?

Cycle-time bottlenecks. A controversial and discussion-provoking paper from UT Austin that appeared in the 2000 International Symposium on Computer Architecture forecast a gloomy future for centralized processor architectures, such as SMT. The paper projected poor wire scaling (i.e., big time wire delays) as process technologies shrinked, which would (theoretically) limit the performance improvement of SMT relative to more decentralized architectures, such as CMP. With this in mind, we think it's time to remeasure the time needed to access SMT's hardware data structures, such as the register file, the integer and floating point instruction queues, all caches & TLBs, the activity list, the branch prediction tables and even main memory, to insure that they can be accessed in the number of cycles we have currently allotted. Fortunately, there is a publicly available tool to do the measurements (ECacti). But there are still some low-level parameters we need to nail down, such as the appropriate clock scaling target, clock skew and pipeline latch time. This would be a good project for someone with EE in their background or who would like to learn about issues related to designs that are lower level than the microarchitecture. The goal for this project would be: given the process technology forecast for about three years from now, to design the most aggressive SMT architecture whose hardware structures can be accessed within a single cycle (except for the caches and memory of course). The following three projects are related to this.

Clusters of functional units. SMT is based on a wide-issue superscalar design (the main reason its commercial transition should be easy), and most of its hardware resources are shared by all threads (one of the reasons for its high instruction throughput). The flipside of these advantages is that they require long wire lengths between various hardware resources, such as from the register files to the functional units. And wire length is directly proportional to cycle time. To reduce cycle time we should investigate different ways to cluster the functional units and other hardware so as to reduce wire length. A good starting point would be the work on clustered architectures by Farkas, Chow, Jouppi and Vranesic (Micro30 98).

Other critical paths In addition to, or instead of, a clustered SMT implementation, we should identify the potential critical paths in an SMT processor (possible candidates are register file access, register renaming, hardware dependence checking and L1 cache access), and devise ways of reducing them with either clever layout or scaling back on the complexity of the design. If the latter, we will want to ascertain the performance hit SMT takes as a result. A good starting point here is the critical path analysis and microarchitecture designs that reduce superscalar critical paths by Palacharla, Jouppi and Smith (ISCA 97). In the initial stages of SMT research, we investigated thread-private L1 caches, but scrapped them for shared caches because of lower performance. Maybe now's the time to reconsider, in the light of decreasing cycle times caused by denser process technologies Also see Renaming registers above.

Bypass logic. Gad Sheaffer at Intel has suggested that because of its latency-hiding ability, an SMT does not need bypass logic. If so, this would obviate the concerns of the slowness of our bypassing, which, because of the longer SMT pipeline, comprises several levels. We should check this out.

Better mechanisms for choosing the thread(s) to fetch. Glenn Hinton at Intel has suggested that some of the thread fetching policies we rejected in the ISCA 1996 paper might obtain better performance if coupled with more aggressive hardware. For example, we could revisit MISSCOUNT only for long latency misses and BRCOUNT when coupled with branch prediction hardware that evaluates the confidence of the prediction. A place to begin the latter is with Jim Smith's branch confidence paper.

Hardware performance monitoring for better SMT performance. Currently the instruction queue keeps trace of the number of instructions per thread it holds, to enable the instruction fetcher to choose the best thread to fetch from next (the one making the best progress through the queue, i.e., the one with the fewest instructions in the queue). What other feedback mechanisms would be useful?

The guts of operating systems and simulators. Much of this research will involve augmenting the infrastructure we have built to experiment with and measure real operating systems in an SMT environment. This infrastructure is based on the SimOS simulation system. SimOS is an instruction set simulator that is detailed enough to boot and execute an entire operating system, i.e., it simulates all privileged state, including I/O, interrupts, and exception handling. For the Compaq Alpha, SimOS simulates Tru64 UNIX, a high-calibre 64-bit operating system, which is optimized for many modern tasks, including network handling. Compaq SimOS also simulates Alpha PALcode, a type of "microcode" in which the lowest level operating system support services, such as TLB miss handling, execute. Exposure to this software environment will offer valuable experience in understanding large-scale simulators and modern operating systems.

This page maintained by Susan Eggers

eggers@cs.washington.edu