Fall 1992 Technical Report Abstracts
                       From 92-06-01 To 92-12-08
                    Computer Science and Engineering
                        University of Washington 


TR # 92-06-01
``The Kheystone Benchmark for Parallel Performance Prediction''
Calvin Lin and Lawrence Snyder

The Kheystone is a simple parameterized program that is used to
characterize the execution of other more complex parallel programs. By
using information from Kheystone running on one machine, along with
two values describing the computation and communication times for a
new machine, the program's performance on the new machine can be
predicted. This methodology and the Kheystone benchmark are introduced
by using the Modified Gram-Schmidt method for QR factorization as a
sample application. The characterizations of the sample program are
given for two source machines, the Sequent Symmetry and the Intel
iPSC/2, and these are used to predict the behavior of the original QR
program on five machines; the Symmetry, nCUBE/7, two iPSC/2's and the
BBN Butterfly. These initial results indicate that the method can be
surprisingly effective.


TR # 92-06-02
``Weyl: A Language for Computer Graphics and Computer Aided Geometric Design''
Austin Dahl

The concepts and notation of affine geometry are useful for research
in computer graphics and computer aided geometric design. However, the
notation is lost when algorithms are moved to an implementation
language.  In this thesis we present a language whose syntax matches
the notation of computer graphics and computer aided geometric design
more closely than traditional languages. The language also provides
built in types for affine geometry, supports curves and surfaces with
polynomial and blossom data types, is interactive and provides
built-in graphics primitives, which are syntactically lightweight. The
product is a powerful, elegant language that allows the user to type
geometric expressions and instantly see the results displayed
graphically.


TR # 92-06-03
``Reducing Memory Latency via Non-Blocking and Prefetching Caches''
Tien-Fu Chen and Jean-Loup Baer

Non-blocking caches and prefetching caches are two techniques for
hiding memory latency by exploiting the overlap of processor
computations with data accesses. A non-blocking cache allows execution
to proceed concurrently with cache misses as long as dependency
constraints are observed, thus exploiting post-miss operations. A
prefetching cache generates prefetch requests to bring data in the
cache before it is actually needed, thus allowing overlap with
pre-miss computations.

In this paper, we evaluate the effectiveness of these two
hardware-based schemes. We propose a hybrid design based on the
combination of these approaches. We also consider compiler-based
optimizations to enhance the effectiveness of non-blocking caches.
Results from instruction level simulations on the SPEC benchmarks show
that the hardware prefetching caches generally outperform non-blocking
caches. Also, the relative effectiveness of non-blocking caches is
more adversely affected by an increase in memory latency than that of
prefetching caches. However, the performance of non-blocking caches
can be improved substantially by compiler optimizations such as
instruction scheduling and register renaming. The hybrid design can be
very effective in reducing the memory latency penalty for many
applications.


TR # 92-06-04
``The Virtual System Model: A Scalable Approach to Organizing Large Systems''
Clifford Neuman

Naming is critical in distributed systems. Names identify files,
services, processors, and users. A name service translates the names
of objects to the information needed to access those objects. The
growth of distributed systems brings with it an increase in the number
of objects to be named. Existing naming techniques are derived from
techniques used on centralized systems and are not sufficient for
organizing the large number of objects that are becoming available.

This dissertation presents the Virtual System Model, a new model for
naming that helps users organize the information available in large
systems. The Virtual System Model has four principal features: support
for customizable name spaces, tools to help users construct name
spaces, support for synonyms, and a method for selecting the
appropriate name space when resolving names.

In the Virtual System Model users create customized views of the
system, called virtual systems, using tools that allow new views to be
specified as functions of other, possibly changing, views. In a
customized view, objects that users frequently use have short names
while those that are never used are hidden from view. A problem that
arises from customized naming is the lack of name transparency: the
same name may refer to different objects when used in different name
spaces. This problem is addressed by supporting closure: each object
has an associated name space, and that name space is used to resolve
names that are specified by the object.

The dissertation begins by discussing the naming mechanisms used in
existing systems and by highlighting the problems that arise as
systems grow. The Virtual System Model is described, and its
differences from existing models of naming are discussed. The
application of the Virtual System Model to distributed systems is
described, including its use organizing and searching for information,
selecting processors and services, and specifying security
requirements.

A prototype file system named Prospero demonstrates the usefulness of
the model. The prototype is used on more than 7,500 systems in 29
countries to organize and explore information available from Internet
archive sites worldwide. The design, implementation, and use of the
prototype are described.


TR # 92-07-01
``Multi-Garnet: Integrating Multi-Way Constraints with Garnet''
Michael Sannella and Alan Borning

Constraints provide a useful mechanism for maintaining relations in
user interface toolkits. Garnet is a widely-used user interface
toolkit with considerable functionality, based on one-way, required
constraints.  Multi-Garnet extends Garnet by adding support for
multi-way constraints and constraint hierarchies with both required
and preferential constraints. This document contains three chapters
describing Multi-Garnet:

Chapter 1 presents a high-level overview of Multi-Garnet. To motivate
the development of Multi-Garnet, we examine the Garnet constraint
system, present some realistic user interface problems that are
difficult to handle in Garnet, and demonstrate how Multi-Garnet
addresses these problems. We provide details on how Multi-Garnet
supports some of the features of Garnet, including constraints with
pointer variables, and inheritance of constraints.

Chapter 2 contains a reference manual for the current version of
Multi-Garnet (version 2.1). This includes information on compiling and
loading Multi-Garnet, as well as documentation for the functions and
macros used to create and manipulate Multi-Garnet constraints. This
chapter also contains additional details on the implementation of
Multi-Garnet.

Chapter 3 describes a large Multi-Garnet example: a scatterplot
displaying a set of points. Multi-Garnet constraints are used to
maintain relationships between the data values, the screen positions
of the points, and the positions of the X and Y-axes. Multiple
interaction modes allow manipulating the scatterplot points and axes
in different ways.


TR # 92-07-04
``An Efficient and Highly Available Read-One Write-All Protocol for
Replicated Data ''
Ed Lazowska and Michael Rabinovich

A new read-one write-all (ROWA) protocol for replicated data is
proposed that allows a system to adjust to failures dynamically in
order to keep the data available. If failures arrive mostly
sequentially, our protocol keeps the data available as long as there
is at least one operational replica. This is achieved by making the
epoch mechanism, previously applicable to non-ROWA schemes only,
usable within the ROWA discipline.  Also, adjusting the system to a
new configuration in our protocol is done completely asynchronously
with reads and writes. In contrast, in the existing dynamic schemes
(both ROWA and non-ROWA), system re- configuration may interfere with
and delay user transactions.


TR # 92-07-05 
``Multi-way versus One-way Constraints in User Interfaces: Experience
with the DeltaBlue Algorithm''
Michael Sannella, Bjorn Freeman-Benson, John Maloney, and Alan Borning

The efficient satisfaction of constraints is essential to the
performance of constraint-based user interfaces. In the past, most
constraint-based user interfaces have used one-way rather than
multi-way constraints because of a widespread belief that one-way
constraints were more efficient. In this paper we argue that many user
interface construction problems are handled more naturally and
elegantly by multi-way constraints than by one-way constraints. We
present pseudocode for an incremental multi-way constraint
satisfaction algorithm, DeltaBlue, and describe experience in using
the algorithm in two user interface toolkits. Finally, we provide
performance figures demonstrating that multi-way constraint solvers
can be entirely competitive in performance with one-way constraint
solvers.


TR # 92-07-07
``Non-Uniformities Introduced By Virtual Channel Deadlock''
Kevin Bolding

A common scheme for preventing deadlock in networks is the virtual
channel method of Dally and Seitz. Due to the nature of this scheme,
an otherwise completely uniform network will have non-uniformities
introduced into it. The variations introduce several effects, ranging
>from limitations on overall network performance to differences in
observed network characteristics from node to node and from message to
message.


TR # 92-07-09
``Beyond the Naivety of Grades: Educational Record Keeping for the
Twenty-First Century''
Steven L. Tanimoto

At its core, education is an information-based activity: teaching is
usually thought of as the presentation of information, and learning as
the acquisition of information. Yet modern information-management
methodologies have been sorely lacking in US public education. For
example, the primary record of student progress is the transcript of
grades _ a practice from the late nineteenth century that is now
thoroughly entrenched as a major tenet of the academic bureaucracy.
Today, the needs for recorded information about students' progress and
learning styles are much greater, as a result not only of today's
diversity of student backgrounds, but because the goals of learning
themselves are more diverse, and the educational environment is
more decentralized and uncoordinated than ever before. Accurate and
detailed information is needed not only about each student's mastery
of concepts, vocabulary, problem-solving and other knowledge but also
about instructional approaches known to be successful with the student
and about extracurricular (e.g., family) problems the student is
confronting _ in short, information about anything that ultimately
affects the learning process. Management is impossible without
information.

While it may be possible to make a few incremental improvements to
education by "throwing technology at the classroom," major
improvements will not happen until the current approach to education
is reorganized.  Key features of a workable solution are the
elimination of grades as we know them, the integration of detailed
monitoring and management of each student's progress, and
restructuring the record-keeping system in order to instill in each
student a new attitude of enthusiastic acceptance of responsibility
for his or her own education.


TR # 92-08-01
``Papers Presented at the Microelectronic System Education Conference 1989-91''
Gaetano Borriello, Carl Ebeling, Larry McMurchie, Neil McKenzie

The Microelectronic System Education Conference and Exhibition has
been held each summer since 1988 to bring together educators dedicated
to furthering undergraduate and graduate education in designing and
building innovative microelectronic systems. The conference emphasizes
the impact of the latest developments in research (CAD, software
methodologies, tools and technologies for system building) on
university education.

This reports collects the papers we have presented at this conference
over the three years from 1989 to 1991.  They address aspects for both
our undergraduate and graduate curricula in digital system design.


TR # 92-08-02
``Predicting Deterministic Execution Times of Real-Time Programs''
C. Y. Park

The most distinctive characteristic of a real-time system is that it
requires temporal correctness. Therefore, analyzing the timing
behavior of a real-time system is essential. In this dissertation, we
address this problem by predicting the deterministic execution times
of the programs comprising a system. Our approach is to predict a time
interval, covering all execution cases, with an analytic method at the
source program level.

The basic prediction method is based on a formal timing schema for a
source-level construct. A simple and efficient timing tool for C
programs is implemented in a practical setting. Comparison with
measurements shows that predictions are safe and usually tight.
However, the tool produces loose predictions for some complex programs
because of infeasible paths.

For tighter predictions, we analyze the dynamic behavior of a program
using information provided by a user.  We introduce a formal path
model where both a program and user execution information are
represented by a set of program paths described by an extended regular
expression. Infeasible paths are eliminated by intersecting the two
path sets. With a well-defined practical high-level interface
language, the path model can be used in an easy and efficient way. We
also introduce a method to verify given user information with known
program verification techniques. Extended with path analysis, a timing
tool yields safe and accurate predictions for a wide range of
programs. It also supports incremental refinement where overall
prediction cost is scalable with respect to desired precision.

As a start for deterministic timing analysis of input/output
operations, we introduce a method to predict the execution time of an
I/O statement. The main problems in I/O analysis are predicting
waiting time for a device or data and analyzing interference caused by
asynchronous execution, in a variety of implementation policies. Our
approach is to develop a general framework and to apply it to a
specific case in a systematic way. We provide a schema refinement rule
for each implementation policy that transforms a timing schema to be
more specific and thus more predictable in a target system. We also
introduce formulas that compute the effect of interference in a target
system.

We developed methods to predict deterministic program execution times,
and validated them by experiments. Predictions are safe, accurate and
feasible with reasonable cost. Future work includes experimenting with
large programs, analyzing concurrent programs, and reasoning about the
timing properties of a system with predictions.


TR # 92-08-03
``Adding Schedular Activations to Mach 3.0''
Paul Barton-Davis, Dylan McNamee, Raj Vaswani, and Edward D. Lazowska

When user-level threads are built on top of traditional kernel
threads, they can exhibit poor performance or even incorrect behavior
in the face of blocking kernel operations such as I/O, page faults,
and processor preemption. This problem can be solved by building
user-level threads on top of a new kernel entity, the scheduler
activation. The goal of the effort described in this paper was to
implement scheduler activations in the Mach 3.0 operating system. This
is an "implementation" paper rather than a "concepts" paper: we
outline the design decisions made, the kernel modifications required,
and our additions to the CThreads thread library to take advantage of
the new kernel structure.


TR # 92-09-01
Double Exponential Inseparability of Robinson Subsystem Q_T
``From the Logically False Formulas in the Language of Addition''
Giovanni Faglia and Paul Young

In this work we prove a double exponential time inseparability result
for a finitely axiomatizable theory that is weaker than ADDAX. Every
set that separates from the logically false sentences of addition is
not recognizable by any Turing machine working in double exponential
time. This implies also that any theory of addition that is consistent
within particular any theory contained in Pr_is at least double
exponential time difficult.


TR # 92-09-03
``Monitoring Timing Constraints in Distributed Real-time Systems''
Sitaram C. V. Raju, Ragunathan Rajkumar, Farnam Jahanian

In distributed real-time systems, the unpredictability of interactions
with the external environment and vio- lations of design assumptions
at run-time can lead to unexpected conditions such as system overload
and the missing of deadlines. It is highly desirable that these error
conditions do not lead to total system failure and that the critical
functions of the system are still performed. In this paper, we
describe a run-time facility for monitoring distributed real-time
systems. In particular, we focus on the problem of detecting
violations of timing assertions in a distributed environment in which
the real-time tasks run on multiple processors, and timing constraints
can be inter-processor or intra-processor constraints. Constraint
violations are detected at the earliest possible time by deriving and
checking intermediate constraints from the user-specified con-
straints. If the violations must be detected as early as possible, the
problem of minimizing the number of messages to be exchanged between
the processors becomes intractable. We then characterize a sub-class
of timing constraints which occur commonly in distributed real-time
systems and whose message requirements are minimized easily. We also
take into account the drift among the various processor clocks when
detecting a violation of a timing assertion. We finally describe a
prototype implementation of our distributed run-time monitor on a
network of IBM RS/6000 workstations.


TR # 92-09-04
``The Systematic Plan Adaptor: A Formal Foundation for Case-Based Planning''
Steven Hanks and Daniel S. Weld

Research efforts in case-based reasoning have made advances in various
problem domains, but general principles and domain-independent
algorithms have been slower to emerge. We explore the theoretical
foundations of case-based planning, in particular characterizing the
fundamental tradeoffs that govern the process of plan adaptation. To
do so we view the planning process as search through a graph of
partial plans: plan gener- ation starts at the graph's root and adds
constraints, while plan adaptation starts at an arbitrary place in the
graph and can either add or delete constraints.

This paper develops a domain-independent algorithm for case-based plan
adaptation, SPA, that is sound, complete, and systematic. Since our
algorithm is sound, every plan it returns is guaranteed to achieve the
goal. Since our algorithm is complete, it is guaranteed to find a plan
if one exists, regardless of the initial plan returned from the
library. Systematicity guarantees that our algorithm explores the
space of partially specified, incomplete plans in a manner which
avoids redundant or wasted effort. This paper describes the algorithm
and its implementation in some detail, shows how our framework sheds
light on existing heuristic, transformational planners, and presents
experimental results that demonstrate a systematic relationship
between library reuse and computational savings in a simple domain.


TR # 92-09-06
``Implementations and Variations of a Maximum-Flow Algorithm''
Joao C. Setubal

We study the push-relabel algorithm for the maximum flow problem and
present a number of novel results chiefly concerning its performance
in practice. We show that an implementation of the push-relabel algo-
rithm using the global relabeling heuristic is by far the fastest for
the sparse graphs tested, when compared to four other algorithm
implementations. These results show that the push-relabel algorithm is
even better in practice than previous studies suggested. We present a
new modification of the push-relabel algorithm that allows its
efficient implementation in parallel. The modification allows global
relabeling to be executed concurrently with push and relabel
operations. This technique, combined with special data structures that
enable a limited ability for parallel granularity control, result in a
parallel program that achieves good speed-ups over the sequential
implementation, in a shared-memory multiprocessor. Main code for this
im- plementation is presented in an appendix. We extend both
sequential and parallel results to the bipartite matching problem. We
show that a specialized implementation of the push-relabel algorithm
for this problem is competitive with or faster than implementations
of two other algorithms that have better worst-case time bounds. A
parallel implementation of the push-relabel algorithm for the
bipartite matching problem based on the techniques mentioned above was
done and achieved modest speed-ups over the sequential im-
plementation. Although the speed-ups are small they are significant
since the existing parallel algorithms for bipartite matching seem to
be quite impractical. Finally we present a new variation of the
push-relabel algorithm that allows positive or negative excess flow in
active vertices. Although the running time bounds are the same as in
the basic push-relabel algorithm, this variation is more versatile,
being particularly useful for solving general instances of the dynamic
maximum flow problem in practice.


TR # 92-09-07
``Improving the Performance of Message-Passing Applications by Multithreading''
Edward W. Felten and Dylan McNamee

Achieving maximum performance in message-passing programs requires
that calculation and communication be overlapped. However, the program
transformations required to achieve this overlap are error-prone and
add significant complexity to the application program. We argue that
calculation/communication overlap can be achieved easily and
consistently by executing multiple threads of control on each
processor, and that this approach is practical on message-passing
architectures without any special hardware support. We present timing
data for a typical message-passing application, to demonstrate the
advantage of our scheme.


TR # 92-09-08
``Techniques for File System Simulation''
Chandramohan A. Thekkath, John Wilkes, and Edward D. Lazowska

The design of file and disk systems is an extremely active area. This
paper describes a collection of techniques for performing careful
simulation-based evaluations of such systems. These techniques have
novel aspects in the following areas: workload characterization, file
system modeling, and disk behavior representation. They make feasible
the detailed simulation of new I/O hardware and file system software,
and of extensions to existing designs. In particular, using the
techniques described here is likely to make comparative file system
studies more accurate.

In addition to these specific contributions, the paper makes two
broader points. First, it argues that detailed simulations are
appropriate and necessary in many cases. Second, it demonstrates that
detailed models need not be difficult or time consuming to construct.


TR # 92-10-01
``Generalized B-spline Surfaces of Arbitrary Topological Type''
Charles Teorell Loop

B-spline surfaces, although widely used, are incapable of describing
surfaces of arbitrary topological type.  It is not possible to model a
general closed surface or a surface with handles as a single
non-degenerate B-spline. In practice such surfaces are often needed.
In this thesis, a generalization of bicubic tensor product and quartic
triangular B-spline surfaces is presented that is capable of capturing
surfaces of arbitrary topological type. These results are obtained
by relaxing the sufficient but not necessary smoothness constraints
imposed by B-splines and through the use of an n-sided generalization
of Bezier surfaces called S-patches.


TR # 92-10-02
``Time-Space Tradeoffs for Graph s-t Connectivity''
Gregory Barnes

The problem of graph s-t connectivity, determining whether two
vertices s and t in a graph are in the same connected component, is a
fundamental problem in computational complexity theory. Determining
the space complexity of s-t connectivity either for directed graphs
(stcon) or for undirected graphs (ustcon) would tell us a great deal
about the relationships among deterministic, nondeterministic, and
probabilistic logarithmic space bounded complexity classes.

A fruitful intermediate step to determining the space complexity of
stcon and ustcon is to explore time-space tradeoffs for the problems:
the simultaneous time and space requirements of algorithms for
connectivity.  Prior to this work, all deterministic connectivity
algorithms that used less than linear space (the space bound for
well-known algorithms such as breadth-first and depth-first search)
required superpolynomial time (e.g.  Savitch's algorithm, which uses
O(log^2 n) space and n^(O(logn)) time).

In this thesis, we explore various techniques for creating
small-space, but polynomial-time algorithms for undirected and
directed graph connectivity. We present sublinear space, polynomial
time deterministic algorithms for both problems. No previous such
algorithms were known for these problems; in fact, there was some
evidence that such algorithms could not exist for directed graphs. We
present a subproblem, the short paths problem, that could separate
stcon and ustcon: we present evidence that finding the length of a
simple path is intrinsic to solving stcon, but not ustcon, in small
space.


TR # 92-10-03
``A Prototyping Environment for Specifying, Executing and Checking
Communicating Real-time State Machines''
Sitaram C. V. Raju and Alan C. Shaw

We describe a toolset, consisting of a graphical editor, a simulator,
and an assertion checker, for prototyping distributed real-time
systems that are specified as Communicating Real-time State Machines
(CRSMs).  CRSMs are timed state machines that communicate
synchronously over uni-directional channels. The system behavior of
CRSMs is characterized by a time-stamped trace of communication
events. Safety and timing assertions on the trace of communication
events are expressed in a notation based on Real-Time Logic. We
illustrate the novel aspects of the simulator and assertion checker by
specifying a traffic-light controller and other real-time systems.


TR # 92-10-04
``A Multi-Level Hierarchical Cache Coherence Protocol for Multiprocessors''
Craig Anderson and Jean-Loup Baer

In order to meet the computational needs of the next decade,
shared-memory processors must be scalable.  Though single shared-bus
architectures have been successful in the past, limited bus bandwidth
restricts the number of processor that can be effectively put on a
single bus machine, One architecture that has been proposed to solve
the limited bandwidth problem connects processors using a tree
hierarchy of buses. In this paper we present a tool to study a
hierarchical bus based shared-memory system. We show that there is a
need for transitional states in the cache coherence protocol that are
not necessary in single bus architecture.  We also give several
examples of the protocol in action. Finally, we conclude with some
preliminary results, and some examples of how the protocol and
architecture could be made more efficient.


TR # 92-10-05
``Improving the Performance of Runtime Parallelization''
Shun-tak Leung and John Zahorjan

When the inter-iteration dependency pattern of the iterations of a
loop can not be determined statically, compile time parallelization of
the loop is not possible. In these cases, runtime parallelization [8]
is the only alternative. The idea is to transform the loop into two
code fragments: the inspector and the executor. When the program is
run, the inspector examines the iteration dependencies and constructs
a parallel schedule.  The executor subsequently uses that schedule to
carry out the actual computation in parallel.

In this paper, we show how to reduce the overhead of running the
inspector through its parallel execution.  We describe two related
approaches. The first, which emphasizes inspector efficiency, achieves
nearly linear speedup relative to a sequential execution of the
inspector, but produces a schedule that may be less efficient for the
executor. The second technique, which emphasizes executor efficiency,
does not in general achieve linear speedup of the inspector, but is
guaranteed to produce the best achievable schedule. We present these
techniques, show that they are correct, and compare their performance
to existing techniques using a set of experiments.

Because in this paper we are optimizing inspector time, by leaving the
executor unchanged, the techniques we present have the most dramatic
effect when the inspector must be run for each invocation of the
source loop. In a companion paper, we explore techniques that build
upon those developed here to also improve executor performance.


TR # 92-10-06
``Counting Protocols for Reliable End-to-End Transmission''
Richard Ladner, Anthony LaMarca, Ewan Tempero

We introduce a new class of protocols, called counting protocols.
These protocols provide a reliable virtual circuit for computer
networks in which packets may be delivered out of order.  Standard
protocols use bounded sequence numbers to identify lost packets. In
networks that do not remove very old packets, these protocols may
experience wrap-around and fail. Counting protocols use a different
approach.

Three counting protocols are presented: (i) the one-bit protocol which
uses one bit header and sends one packet per message under ideal
conditions, but performs poorly in networks with realistic loss rates,
(ii) the mode protocol which uses multiple-bit headers and behaves
well in networks with a realistic loss rates, and (iii) the mode-power
protocol which also uses multiple-bit headers and performs better than
the mode protocol on short sequences, but worse on very long
sequences.

We give the derivations of the three counting protocols and prove the
one-bit protocol correct. We do a careful performance analysis of both
the one-bit and mode protocols and present the results of a simulation
study to demonstrate the performance of the mode-power protocol.


TR # 92-10-07
``Practical Issues in Retiming Latch-Based Circuits''
Brian Lockyear and Carl Ebeling

In this paper we examine some of the practical issues that arise when
using retiming to optimize real systems.  We first propose that
retiming should be an integral part of a practical system design
methodology and not just a low-level optimization technique. We then
remove some of the practical limitations of retiming by showing how to
handle clock skew and latch setup time and propagation delays. Finally
we describe how to use retiming to generate circuits that are as
robust as possible to variations in delay and clock skew parameters.


TR # 92-10-08
``The UW MacTester: A Low-Cost Functional Tester for Interactive Testing
and Debugging''
Neil McKenzie, Larry McMurchie and Carl Ebeling

We describe a low-cost functional tester that provides a large number
of programmable I/O signals for in- teractively testing and debugging
chips, boards and subsystems. Testing a subsystem requires only
minimal hardware setup which allows the tester to be shared by several
users and reduces the overall cost of test equipment. Testing is
performed under program control which can be driven either by a set of
test vectors or by a user-supplied procedural test program. The tester
system consists of the tester unit, a parallel interface card and a
personal computer as the host. The tester architecture uses Xilinx
FPGAs for the pin electronics and static RAM to enable testing of
dynamic circuits. A simple but powerful software interface to the
tester makes possible a wide variety of testing and debugging
environments.


TR # 92-11-01
``Timing Analysis of Concurrent Systems_An Algorithm for
Determining the Time Separation of Events''
Tod Amon, Henrik Hulgaard, Gaetano Borriello, Steve Burns

A fundamental problem in the synthesis and optimization of concurrent
systems is the determination of the separation time between system
events. We present a theoretical framework for solving this problem
for arbitrary process graphs without conditional behavior and develop
an efficient and exact algorithm based on this theoretical foundation.
Examples are used to demonstrate the operation and generality of the
algorithm.


TR # 92-11-04
``Message Logging and Data-Parallel Languages''
Alexander C. Klaiber and Henry M. Levy

As modern multiprocessor systems increase in scale, the probability of
faults increases to the point where the MTBF is lower than the running
time of typical applications. It is therefore imperative that such
sys- tems provide a means of restarting computations after system
failures. So far, this has usually been the responsibility of the
application programmer. In this paper, we review several proposed
general-purpose restart mechanisms for multiprocessor systems. In
particular, we take a closer look at the class of optimistic message
logging schemes and study the I/O overhead incurred by one
representative implementation. Our benchmarks demonstrate that this
overhead becomes extremely large for applications with high message
traffic. Further, much of the logged information is redundant. We then
propose an alternative recovery scheme that relies on the fact that
parallel programs written in a data-parallel language can be compliled
in such a way as to be independent of message arrival order. Our
scheme achieves the benefits of message logging such as fast output
commit at significantly lower overhead.


TR # 92-11-05 (also UCI TR #92-108)
``An Investigation of the Therac-25 Accidents''
Nancy G. Leveson and Clark S. Turner

Risk in any complex technology is unavoidable. However, important
lessons can be learned from accidents which can be used to design
procedures for reducing risk in the future. Although descriptions of
the Therac-25 medical electron accelerator accidents have been
published previously, they are incomplete and often misleading. This
paper contains a detailed account of these accidents, along with some
lessons that can be learned from them in terms of system engineering,
software engineering, and government regulation of safety-critical
systems involving software.


TR # 92-11-06
``Surface Approximation Using Geometric Hermite Patches''
Stephen Mann

Many surfaces have mathematically complex or computationally expensive
representations. For some applications, an approximation to these
surfaces is adequate. Numerous tangent plane smooth surface
interpolants are reviewed, and inadequacies of these schemes for the
approximation of surfaces are investigated. Two
high-order-of-approximation surface patches are presented and used to
construct continuous, approximating surfaces. The use of these patches
is demonstrated for approximating offset surfaces, algebraic surfaces,
and S-patches.


TR # 92-11-07
``A Self-Timed Counter II''
Ted Kehl

This report is a new version of a previous technical report which was
never issued. Specifically, optimization of the original design
resulted in about a 40% speed-up.


TR # 92-11-08
``A Self-Timed Comparand Register''
Ted Kehl

A comparand register is simply a register with additional circuitry
arranged such that the contents of the register are compared, for
equality, with an input value. This comparand register has to
additional features: It is self-timed and has an inverter chain for
high resolution timing.


TR # 92-12-01
``Bounds on Sample Space Size for Matrix Product Verification''
Donald D. Chinn, Rakesh K. Sinha

We show that the size of any sample space that could be used in
Freivalds' probabilistic matrix product verification algorithm for nxn
matrices is at least (n-1)/epsilon if the error probability is at most
epsilon.  We also provide a characterization of any sample space for
which Freivalds' algorithm has error probability at most epsilon. We
then provide a generalization of Freivalds' algorithm and provide
matching lower and upper bounds on the error probability in terms of
the sample space size and running time.


TR #92-12-02
``Asynchronous Epoch Management in Replicated Databases''
Michael Rabinovich and Edward D. Lazowska

The paper describes a new dynamic protocol for managing replicated
data.  Like the existing dynamic schemes, our protocol involves
reconfiguring the system dynamically to reflect failures and repairs
as they occur, so that the data may be kept available for user
operations even if only one replica of the data remains accessible.
However, in the existing schemes, it is required for the correctness
of the protocol that the system reconfiguration either run as a
special transaction that competes for locks with user operations and
thus can interfere with and delay the latter, or be done within the
protocol for the write operation, which increases the cost of writes
and makes fault tolerance of the system dependent on the rate of write
operations. In contrast, our protocol allows system reconfiguration to
be done asynchronously with and separately from read and write
operations. Therefore, in our protocol, user operations can execute
any time while system reconfiguration is in progress, with no
interference. At the same time, the cost of write operations remains
low and the fault tolerance of the system does not depend on the rate
of writes.


TR # 92-12-03
``A High-Speed Channel Controller for the Chaos Router''
Robert Wille

The channel controller in the chaos router requires that all the
routers in a network be in phase with each other. Furthermore, data
that is sent from one chip must be received by another within a single
cycle. These requirements cause the interconnect delay to effectively
limit the speed at which the router can operate. This report discusses
the design and implementation of a phase-adjusting controller which
eliminates this limiting factor. The phase-adjusting controller is
able to pipeline data on the interconnect and resynchronize it as it
is received. Therefore, the data is not required to be received within
a single cycle and the interconnect ceases to be the speed-limiting
factor. The phase-adjusting controller is able to increase the
bandwidth by up to 40 percent while decreasing the latency by up to 32
percent. In addition, the phase-adjusting controller is tolerant to
clock skew, eliminating the need for phase-locked loops to keep the
clocks in phase with each other.


TR # 92-12-04
``The Portability of Parallel Programs Across MIMD Computers''
Calvin Lin

Parallel processing is important to the future of high performance
computing. At present, the main imped- iment to parallel computing is
the high cost of parallel software. These costs can be reduced by
creating portable programs. But while there have been many proposed
approaches to portable parallel programming, none has yet been
demonstrated to be effective. This thesis tests the effectiveness of
one particular approach - for MIMD computers - that consists of the
CTA machine model, the Phase Abstractions programming model, and the
Orca C programming language. The CTA defines the algorithm designer's
view of a machine, the Phase Abstractions provide an abstraction of
the CTA that guides the construction of programs, and Orca C provides
a language for expressing the programmer's solutions.

We begin by presenting an experimental study that compares the
performance of shared memory and non-shared memory programs on shared
memory machines. These results show that the non-shared memory
programming model, which is based on the CTA machine model, has wide
applicability.

We then study a large scientific benchmark, showing that a Phase
Abstractions implementation of SIMPLE is portable across a variety of
MIMD computers. We also use this example to describe the Orca C
language and show how it leads to flexible programs that can adapt to
different hardware environments through parameterization. Because of
the huge diversity of parallel architectures, this flexibility is
crucial to supporting portability.

Finally, we consider two "real" applications, matrix multiplication
and the Modified Gram-Schmidt method of solving QR factorization.
Because these studies involve algorithm design, they allow us to
indirectly observe the role that the CTA machine model plays in
achieving portability. We develop machine independent algorithms for
these two applications and present performance results for various
parallel computers.


TR # 92-12-05
``Relaxed Consistency and Synchronization in Parallel Processors''
Rick Zucker

Parallel programs often do not obtain close to linear speed-up when
compared to a sequential version of the program running on a
uniprocessor. There are many reasons that linear speed-up is not
obtained. Two important ones are the overhead of synchronization and
memory latency.

Synchronization, the coordination of the work done by different
processors, is an overhead that does not exist in uniprocessor
programs. Therefore, excessive time spent performing synchronization
leads to a loss of performance. Many previous studies to evaluate this
overhead have used artificial benchmarks with high levels of lock
contention. In this dissertation I study both the effects of
synchronization on the performance of real parallel programs and the
impact of the efficiency of the implementation of the synchronization
algorithm. The results show that the frequency of synchronization is
the most significant factor leading to performance loss. When
synchronization occurs sufficiently often, the implementation
algorithm has a non-negligible effect.

Memory latency, the length of time from when a request to memory is
initiated until it completes, is a major problem in multiprocessors.
Many hardware and software enhancements have been proposed to deal
with the problem. One of the ideas is relaxed models of memory
consistency. Relaxed models, such as weak ordering or release
consistency, replace sequential consistency, the usual intuitive model
of how the memory of the system is implemented. With this change in
the memory model, many architectural features can now be used that are
not allowed under sequential consistency, but at the cost of imposing
constraints to the programmer of parallel systems. In this
dissertation I consider many of these architectural features such as
bypassing, lock-up free caches and a software controlled cache
coherence scheme I propose. I attempt to determine the performance
benefits of using such features and which features provide the most
benefit.  The results show that relaxed consistency can provide
significant performance gains for some programs and architectures. The
choice of a given relaxed model does not significantly affect the
gains. Software controlled cache coherence, a scheme that requires a
smaller hardware investment, can provide equivalent performance in
some cases and competitive performance in others.


TR # 92-12-06
``A Direct Version of Shamir and Snir's Lower Bound''
Prasoon Tiwari and Martin Tompa

We present direct proofs of the following results of Shamir and Snir
[Mathematical System Theory, 13, 301-322, 1980] on the depth of
monotone arithmetic circuits over rings of characteristic 0: (1) an
Omega((logp)(logn)) lower bound for computing the product of p nxn
matrices; and (2) an Omega(n) lower bound for computing the permanent of a
n x n matrix.

coloring problem, which is solved heuristically.  Two methods to do so
are described.  Furthermore, the basic runtime parallelization scheme
for shared-memory multiprocessors pays no attention to locality when
making these decisions.  The performance implications of reordering
are examined experimentally on a KSR1 parallel machine as well as
through a simple analytic model of execution time.


TR # 92-12-07
"Reordering Iterations in Runtime Loop Parallelization"
Shun-Tak Leung and John Zahorjan

When a loop in a sequential program is parallelized, it is normally
guaranteed that all flow dependencies and anti-dependencies are
respected so that the result of parallel execution is always the same
as sequential execution.  In some cases, however, the algorithm
implemented by the loop allows the iterations to be executed in a
different sequential order than the one specified in the program.
This opportunity can be exploited to expose parallelism that exists in
the algorithm but is obscured by its sequential program
implementation.

In this paper, we show how parallelization of this kind of loop can be
integrated into the runtime parallelization scheme of Saltz et al.
[17, 18].  Runtime parallelization is a general technique appropriate
for loops whose dependency structures cannot be determined at compile
time.  The compiler generates two pieces of code:  the inspector examines
dependencies at run time and computes a parallel schedule; the
executor executes itertions in parallel according to the computed
schedule.

In our case, the inspector has to solve two problems: choosing an
appropriate sequential order for the iterations and computing a
parallel schedule.  The two problems are treated as a single graph

TR # 92-12-08
"Theory and Practice of Vector Quantizers Trained on Small Training Sets"
David Cohn, Eve A. Riskin, Richard Ladner

We examine how the performance of a memoryless vector quantizer
changes as a function of its training set size. Specifically, we study
how well the training set distortion predicts test distortion when the
training set is a randomly drawn subset of blocks from the test or
training image(s). Using the Vapnik-Chervonenkis dimension, we derive
formal bounds for the difference of test and training distortion of
vector quantizer codebooks. We then describe extensive empirical
simulations that test these bounds for a variety of bit rates and
vector dimensions, and give practical suggestions for determining the
training set size necessary to achieve good generalization from a
codebook.  We conclude that, by using training sets comprised of only
a small fraction of the available data, one can produce results that
are close to the results obtainable when all available data are used.