Currently under reconstruction
Research Topics:
We have just obtained new funding to pursue the research topics
listed below. We are, of course, keen to hear about other topics you might
be interested in. Depending upon your stage of career, they could result
in quals credit, a masters degree, dissertation research,
undergraduate project credit or just some experience in experimental
architecture and operating systems research. If the project is too big
for what you have in mind, we can probably whittle it down.
If one of them interests you, just stop by to talk.
Last modified on
Research Objectives
While many SMT architectural issues have been resolved, there is one major
area that has not been explored at all: the design and performance of
operating systems for multithreaded processors (SMT included).
With one lone exception, Josh Redstone's initial study that examines operating
systems behavior on SPEC and web service workloads
(that will appear in this year's ASPLOS),
all SMT research has centered on user-level applications executing on
user-level instruction-set simulators.
Those few operating systems functions that were
examined (for example, page mapping), were modeled and emulated at a
very high level; most fundamental operating system tasks, such as
interrupt handling, I/O, networking, and scheduling, were not modeled
at all.
Now that we have the infrastructure to simulate the operating system in
detail (i.e., we have ported SMT to the SimOS simulator),
we are poised to begin operating systems-oriented research in earnest.
Despite numerous hardware advances of the last several decades,
the basic structure of operating systems has remained unchanged.
Our goal is to reexamine operating systems design to capitalize on
several hardware capabilities that are unique to SMT and
to meet the demands of today's request-based parallel workloads,
such as databases and web servers, that rely heavily on operating systems
support.
We will focus on the interaction of operating systems activity and several
of SMT's features that are the source of its current performance advantage:
SMT's different execution model
(relative to today's superscalars and shared memory multiprocessors),
its hardware support for lightweight synchronization,
the fine-granularity with which its threads dynamically share most hardware
resources,
and its processor-resident thread state.
Why SMT Could Change the Operating System
Because an operating system is the lowest software layer that executes on
and interfaces directly with hardware, it is very tied to the design and
organization of the processor.
Within the OS, key software strategies and policies are designed to take
advantage of or to compensate for particular architecture features.
Since most commercial server-based operating systems execute on small
multiprocessors
(either shared-memory machines, distributed-memory machines, or both),
they have been designed with these underlying architectures in mind.
Simultaneous Multithreading changes many of these architectural assumptions.
For example, in multiprocessor environments,
threads execute on different processors, and each thread has its own
private copy of many hardware data structures, such as the L1 caches and TLBs.
Any data sharing involves expensive interprocessor communication,
and consequently operating system (and compiler) strategies are designed
to discourage or eliminate such sharing. On an SMT processor,
on the other hand, the environment is largely the opposite from that
provided by shared-memory machines. All SMT threads execute on the same
processor, and most hardware, including the L1 cache and TLB,
is dynamically shared by the threads on a cycle-by-cycle basis.
Using this hardware involves processor-local accesses; consequently,
sharing (and even false sharing) of the resources among threads is cheap
and should be encouraged.
A second difference between multiprocessors and SMT is the granularity of
thread synchronization.
Multiprocessors must synchronize through the closest (to the processor)
memory that the individual processors share, at best the L2 caches,
but often main memory.
The long latencies of these memories dictates the high overhead of
multiprocessor synchronization operations and consequently the coarse
granularity of parallelism.
Since threads execute on the same processor in an SMT,
thread synchronization can take place within the processor,
rather than through memory.
Consequently, it is one or two orders of magnitude faster than traditional
shared-memory synchronization.
We have demonstrated the potential of SMT's feather-weight synchronization
to exploit fine-grained parallelism in applications.
For the research in this proposal, we wish to explore the use of
fine-grained synchronization to improve parallelism and, hence, performance,
within the operating system.
We believe that SMT is a promising architecture for a high-throughput
message processor in particular, and we will explore the potential for
increased parallelism in the OS networking/messaging code.
We will also target the disk and file system, particularly on large
servers handling multiple disks.
Overall, our research objective is to build a truly parallel operating
system,
rather than a (conventional) operating system that simply supports parallel
application programs.
In summary, we will explore the interactions between operating system design
and multithreading architecture, from two points of view.
From one side, we will examine how a multithreaded processor can best support
operating systems needs;
and from the other side, how an operating system can best take advantage of
the unique characteristics of a multithreaded architecture.
Research in Detail
Below is a brief description of a few of the specific activities we have
in mind. New entries are in
red.
Our most basic research will study the impact of operating systems code on
simultaneous multithreaded execution.
We will determine where the operating system spends its time, to identify
the bottlenecks of an OS executing on an SMT processor.
We hope to discover particular places where current operating system design
is inappropriate for multithreaded technology.
This work will focus on several different aspects of OS/SMT interaction.
Beginning with targets that previous studies have shown are degraded
by operating system activity (relative to user-level code),
e.g., branch prediction hardware, caches and TLBs.
It is important that we verify that the operating system does not negatively
impact the high throughput improvements that the user-mode SMT simulations
have demonstrated in the past.
However, if the OS does have a negative impact on throughput,
we will investigate and understand the causes and address them with either
changes to the SMT microarchitecture or operating systems policies.
Josh Redstone
Interthread conflicts.
One side effect of SMT's multithreaded execution is an increase in miss
rates in all hardware data structures,
such as the cache hierarchy, the TLBs, and the branch prediction logic.
The additional misses occur because these structures hold the context of
multiple working sets, rather than just one.
For example, SMT doubles L1 cache miss rates of the SPEC98 workload.
We need to determine where interthread conflicts exist in privileged
operating systems code and devise hardware or software techniques to
reduce them.
Interthread conflicts are an important source of performance degradation
to eliminate.
Conflicts of this nature hobbled SMT database performance until we instituted
a change in page mapping strategy.
Exploiting lightweight parallelism within an operating system.
While operating systems are currently multithreaded,
the multithreading is rather coarse-grain.
In general, each OS thread represents an independent task that the operating
system is concurrently processing, such as an I/O request.
Using SMT's lightweight thread creation and featherweight synchronization
primitives,
we will see where decomposing the coarse-grain parallelism into finer-grain
tasks will improve OS (throughput) performance.
This may involve significant rethinking of conventional operating systems
structures and algorithms,
since it involves not merely dicing up operating systems activities into
smaller pieces,
but constitutes a shift in the purpose of multithreaded operating systems
from managing concurrent, independent threads to improving performance through
parallelism.
In this work we will initially focus on the network and disk subsystems.
Renaming registers.
Our previous work has shown that with compiler- and operating-systems-directed
support, SMT can deallocate renaming registers earlier than when just relying on
the hardware alone.
The result was a reduction in the size of the register file needed to achieve
current SMT performance.
This could be crucial if we are to maintain register file access within an
acceptable cycle time.
However, the work was only done on an applications-based workload.
We'd like to incorporate support for early renaming register deallocation into
the SimOS SMT simulator and gauge its effect when executing the operating system
and OS-intensive applications.
An SMT web server.
Web service is a particularly interesting application for SMT,
because it is highly multithreaded and makes extensive use of the operating
system for both network and file system operations.
In addition, web servers have been developed with different underlying goals,
and therefore have different implementations.
As an example, Apache was designed for portability and Flash for performance.
There are several issues here.
The first addresses how well SMT performs with each type of Web server
software.
Another explores possible architectural or microarchitectural changes
to address
any performance problems we find with supporting Web service on an SMT.
A third compares the structure and performance of both web server designs to
each
other and also across a few key architectures,
namely, single-threaded uniprocessors and SMT processors.
The Apache Web server has already been installed and experimented with
on our SMT/SimOS base;
Flash is still to be done.
Dimitriy Portnov
An MP of SMTs.
A shared-memory multiprocessor,
where each of the CPUs is an SMT processor,
will enable SMT to scale beyond the number of hardware contexts that can be
supported on a single chip.
This leads to new issues of task decomposition, thread scheduling, resource
allocation, hybrid synchronization and coherency schemes, etc.
Our hardware and software solutions will need to be particularly creative in
this environment,
because MPs and SMT often have opposing strengths.
For example, sharing and interprocessor communication on an MP is costly,
whereas sharing and interthread communication on an SMT is extraordinarily
cheap.
Whereas our bias has always been to build the highest performance SMT,
in this environment we might also consider a CMP (chip multiprocessor) of
small-scale SMTs,
which may reduce hardware design time and/or power consumption.
Since simulating an MP of SMTs will be computationally expensive,
this project will require some innovative simulation design.
Other platforms.
Multithreading might be useful in domains other than high-end servers.
We'd like to examine a diverse collection of dedicated/embedded
environments,
such as network interface cards, signal processing applications and robots,
to gauge their appropriateness for SMT techniques.
For example, multithreading is a natural technique to apply in the CPU for
an interface card that handles high-throughput messaging or routing.
In this case, the thread contexts provide low-latency response to message
arrival and a hardware-based mechanism for message multiplexing.
Such embedded applications require highly efficient,
application-specific operating systems with a very low-overhead design.
We will also investigate the appropriateness of active network software in SMT
processors and examine real-time issues in signal processing applications.
This work will involve reexamining the design of embedded processor
microarchitectures.
Dedicated hardware contexts.
We want to examine whether it is possible to effectively reserve hardware
contexts for specific, commonly occurring functions that need to interact
with an executing main program.
Candidates include:
dynamic compilers that execute as co-routines with
the application they are dynamically specializing and then interrupt the
application to begin executing the code they have just dynamically generated;
, profilers that gather
information that can be used to optimize programs as they execute,
e.g., to re-layout instructions and data or specialize code to take advantage
of value locality,
garbage collectors,
and helper functions that compute different outputs for signal processing
applications that meet tighter real-time constraints.
There could be lots of examples here -- use your imagination!
Which functions should be chosen and why?
How many?
What performance benefit do they reap?
Are special hardware or software techniques necessary to support them?
Debugging on an SMT.
We expect that a significant challenge for SMT-based systems will be
debugging the heavily multithreaded applications and operating systems.
SMT's potential for extremely fine-grained synchronization will likely
exacerbate the existing difficulty of multithreaded programming and debugging.
The main issue is what the SMT debugging infrastructure should look like.
It may be possible to influence the design of the SMT scheduler to simplify
debugging.
One option would be to adopt the dedicated hardware context model mentioned
above and execute a debugging thread in a separate hardware context that
monitors application threads,
reporting on their progress or triggering breakpoints on certain
conditions.
OS functions in user space.
A trend in high-performance servers and I/O is towards event-driven (or
upcall-driven) architectures. SMT can potentially improve on this
trend, by dedicating an SMT thread to handle an interrupt; the
dedicated thread could push an event that results from a hardware
interrupt upwards through the operating system, and perhaps even
into user-level code, causing application-level processing to occur.
This design looks similar to active messages, but with a dedicated SMT thread
as the active context.
Because of the ability to simultaneously execute multiple SMT threads,
however, the limitations of current active message handlers
(little functionality because of the low-latency, non-blocking requirements)
could be relieved, vastly simplifying the task of using them.
The extension of this idea would be to design and build an upcall-oriented
OS, in which SMT threads emerge out of the operating system to dispatch
events to applications. With such an operating system, applications would be
structured purely as event handlers or upcall handlers, hopefully
resulting in much higher performance.
Hardware assists.
SMT produces better instruction throughput than other processors despite
not yet having some hardware assists that are standard in most
microarchitectures, such as a victim cache.
Some SMT-specific considerations may apply here.
For example, is the size of the victim cache a function of the number of threads,
the size of the L1 data cache, or something else?
How does SMT victim cache performance compare to that on a superscalar?
How does its optimal size compare to that needed by a superscalar?
Cycle-time bottlenecks.
A controversial and discussion-provoking paper from UT Austin that appeared
in the 2000 International Symposium on Computer Architecture forecast a gloomy
future for centralized processor architectures, such as SMT.
The paper projected poor wire scaling (i.e., big time wire delays)
as process technologies shrinked, which would (theoretically) limit the
performance improvement of SMT relative to more decentralized architectures,
such as CMP.
With this in mind,
we think it's time to remeasure the time needed to access SMT's
hardware data structures,
such as the register file, the integer and floating point instruction queues,
all caches & TLBs, the activity list, the branch prediction tables and even
main memory, to insure that they can be accessed in the number of cycles we
have currently allotted.
Fortunately, there is a publicly available tool to do the measurements
(ECacti).
But there are still some low-level parameters we need to nail down, such as
the appropriate clock scaling target, clock skew and pipeline latch time.
This would be a good project for someone with EE in their background or who
would like to learn about issues related to designs that are lower level
than the microarchitecture.
The goal for this project would be:
given the process technology forecast for about three years from now,
to design the most aggressive SMT architecture whose hardware
structures can be accessed within a single cycle
(except for the caches and memory of course).
The following three projects are related to this.
Clusters of functional units.
SMT is based on a wide-issue superscalar design (the main reason its
commercial transition should be easy), and most of its hardware resources
are shared by all threads (one of the reasons for its high instruction
throughput).
The flipside of these advantages is that they require long wire lengths
between various hardware resources, such as from the register files to
the functional units.
And wire length is directly proportional to cycle time.
To reduce cycle time we should investigate different ways to cluster
the functional units and other hardware so as to reduce wire length.
A good starting point would be the work on clustered architectures by
Farkas, Chow, Jouppi and Vranesic (Micro30 98).
Other critical paths
In addition to, or instead of, a clustered SMT implementation,
we should identify the potential critical paths in an SMT processor
(possible candidates are register file access, register renaming,
hardware dependence checking and L1 cache access),
and devise ways of reducing them with either clever layout or scaling
back on the complexity of the design.
If the latter, we will want to ascertain the performance hit SMT takes
as a result.
A good starting point here is the critical path analysis and
microarchitecture designs that reduce
superscalar critical paths by Palacharla, Jouppi and Smith (ISCA 97).
In the initial stages of SMT research,
we investigated
thread-private L1 caches,
but scrapped them for shared caches because of lower performance.
Maybe now's the time to reconsider,
in the light of decreasing cycle times caused by denser process technologies
Also see Renaming registers above.
Bypass logic.
Gad Sheaffer at Intel has suggested that because of its latency-hiding
ability,
an SMT does not need bypass logic.
If so, this would obviate the concerns of the slowness of our bypassing,
which, because of the longer SMT pipeline, comprises several levels.
We should check this out.
Better mechanisms for choosing the thread(s) to fetch.
Glenn Hinton at Intel has suggested that some of the thread fetching
policies we rejected in the ISCA 1996 paper might obtain better performance
if coupled with more aggressive hardware.
For example, we could revisit MISSCOUNT only for long latency misses
and BRCOUNT when coupled with branch prediction hardware that evaluates
the confidence of the prediction.
A place to begin the latter is with Jim Smith's branch confidence paper.
Hardware performance monitoring for better SMT performance.
Currently the instruction queue keeps trace of the number of instructions
per thread it holds,
to enable the instruction fetcher to choose the best thread to fetch from
next (the one making the best progress through the queue, i.e., the one
with the fewest instructions in the queue).
What other feedback mechanisms would be useful?
The guts of operating systems and simulators.
Much of this research will involve augmenting the infrastructure we have built
to
experiment with and measure real operating systems in an SMT environment.
This infrastructure is based on the SimOS simulation system.
SimOS is an instruction set simulator that is detailed enough to boot and
execute an entire operating system, i.e., it simulates all privileged state,
including I/O, interrupts, and exception handling.
For the Compaq Alpha, SimOS simulates Tru64 UNIX,
a high-calibre 64-bit operating system,
which is optimized for many modern tasks, including network handling.
Compaq SimOS also simulates Alpha PALcode,
a type of "microcode" in which the
lowest level operating system support services,
such as TLB miss handling, execute.
Exposure to this software environment will offer valuable experience in
understanding large-scale simulators and modern operating systems.
This page maintained by Susan Eggers
eggers@cs.washington.edu