Fall 1992 Technical Report Abstracts From 92-06-01 To 92-12-08 Computer Science and Engineering University of Washington TR # 92-06-01 ``The Kheystone Benchmark for Parallel Performance Prediction'' Calvin Lin and Lawrence Snyder The Kheystone is a simple parameterized program that is used to characterize the execution of other more complex parallel programs. By using information from Kheystone running on one machine, along with two values describing the computation and communication times for a new machine, the program's performance on the new machine can be predicted. This methodology and the Kheystone benchmark are introduced by using the Modified Gram-Schmidt method for QR factorization as a sample application. The characterizations of the sample program are given for two source machines, the Sequent Symmetry and the Intel iPSC/2, and these are used to predict the behavior of the original QR program on five machines; the Symmetry, nCUBE/7, two iPSC/2's and the BBN Butterfly. These initial results indicate that the method can be surprisingly effective. TR # 92-06-02 ``Weyl: A Language for Computer Graphics and Computer Aided Geometric Design'' Austin Dahl The concepts and notation of affine geometry are useful for research in computer graphics and computer aided geometric design. However, the notation is lost when algorithms are moved to an implementation language. In this thesis we present a language whose syntax matches the notation of computer graphics and computer aided geometric design more closely than traditional languages. The language also provides built in types for affine geometry, supports curves and surfaces with polynomial and blossom data types, is interactive and provides built-in graphics primitives, which are syntactically lightweight. The product is a powerful, elegant language that allows the user to type geometric expressions and instantly see the results displayed graphically. TR # 92-06-03 ``Reducing Memory Latency via Non-Blocking and Prefetching Caches'' Tien-Fu Chen and Jean-Loup Baer Non-blocking caches and prefetching caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A non-blocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations. A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with pre-miss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimizations to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform non-blocking caches. Also, the relative effectiveness of non-blocking caches is more adversely affected by an increase in memory latency than that of prefetching caches. However, the performance of non-blocking caches can be improved substantially by compiler optimizations such as instruction scheduling and register renaming. The hybrid design can be very effective in reducing the memory latency penalty for many applications. TR # 92-06-04 ``The Virtual System Model: A Scalable Approach to Organizing Large Systems'' Clifford Neuman Naming is critical in distributed systems. Names identify files, services, processors, and users. A name service translates the names of objects to the information needed to access those objects. The growth of distributed systems brings with it an increase in the number of objects to be named. Existing naming techniques are derived from techniques used on centralized systems and are not sufficient for organizing the large number of objects that are becoming available. This dissertation presents the Virtual System Model, a new model for naming that helps users organize the information available in large systems. The Virtual System Model has four principal features: support for customizable name spaces, tools to help users construct name spaces, support for synonyms, and a method for selecting the appropriate name space when resolving names. In the Virtual System Model users create customized views of the system, called virtual systems, using tools that allow new views to be specified as functions of other, possibly changing, views. In a customized view, objects that users frequently use have short names while those that are never used are hidden from view. A problem that arises from customized naming is the lack of name transparency: the same name may refer to different objects when used in different name spaces. This problem is addressed by supporting closure: each object has an associated name space, and that name space is used to resolve names that are specified by the object. The dissertation begins by discussing the naming mechanisms used in existing systems and by highlighting the problems that arise as systems grow. The Virtual System Model is described, and its differences from existing models of naming are discussed. The application of the Virtual System Model to distributed systems is described, including its use organizing and searching for information, selecting processors and services, and specifying security requirements. A prototype file system named Prospero demonstrates the usefulness of the model. The prototype is used on more than 7,500 systems in 29 countries to organize and explore information available from Internet archive sites worldwide. The design, implementation, and use of the prototype are described. TR # 92-07-01 ``Multi-Garnet: Integrating Multi-Way Constraints with Garnet'' Michael Sannella and Alan Borning Constraints provide a useful mechanism for maintaining relations in user interface toolkits. Garnet is a widely-used user interface toolkit with considerable functionality, based on one-way, required constraints. Multi-Garnet extends Garnet by adding support for multi-way constraints and constraint hierarchies with both required and preferential constraints. This document contains three chapters describing Multi-Garnet: Chapter 1 presents a high-level overview of Multi-Garnet. To motivate the development of Multi-Garnet, we examine the Garnet constraint system, present some realistic user interface problems that are difficult to handle in Garnet, and demonstrate how Multi-Garnet addresses these problems. We provide details on how Multi-Garnet supports some of the features of Garnet, including constraints with pointer variables, and inheritance of constraints. Chapter 2 contains a reference manual for the current version of Multi-Garnet (version 2.1). This includes information on compiling and loading Multi-Garnet, as well as documentation for the functions and macros used to create and manipulate Multi-Garnet constraints. This chapter also contains additional details on the implementation of Multi-Garnet. Chapter 3 describes a large Multi-Garnet example: a scatterplot displaying a set of points. Multi-Garnet constraints are used to maintain relationships between the data values, the screen positions of the points, and the positions of the X and Y-axes. Multiple interaction modes allow manipulating the scatterplot points and axes in different ways. TR # 92-07-04 ``An Efficient and Highly Available Read-One Write-All Protocol for Replicated Data '' Ed Lazowska and Michael Rabinovich A new read-one write-all (ROWA) protocol for replicated data is proposed that allows a system to adjust to failures dynamically in order to keep the data available. If failures arrive mostly sequentially, our protocol keeps the data available as long as there is at least one operational replica. This is achieved by making the epoch mechanism, previously applicable to non-ROWA schemes only, usable within the ROWA discipline. Also, adjusting the system to a new configuration in our protocol is done completely asynchronously with reads and writes. In contrast, in the existing dynamic schemes (both ROWA and non-ROWA), system re- configuration may interfere with and delay user transactions. TR # 92-07-05 ``Multi-way versus One-way Constraints in User Interfaces: Experience with the DeltaBlue Algorithm'' Michael Sannella, Bjorn Freeman-Benson, John Maloney, and Alan Borning The efficient satisfaction of constraints is essential to the performance of constraint-based user interfaces. In the past, most constraint-based user interfaces have used one-way rather than multi-way constraints because of a widespread belief that one-way constraints were more efficient. In this paper we argue that many user interface construction problems are handled more naturally and elegantly by multi-way constraints than by one-way constraints. We present pseudocode for an incremental multi-way constraint satisfaction algorithm, DeltaBlue, and describe experience in using the algorithm in two user interface toolkits. Finally, we provide performance figures demonstrating that multi-way constraint solvers can be entirely competitive in performance with one-way constraint solvers. TR # 92-07-07 ``Non-Uniformities Introduced By Virtual Channel Deadlock'' Kevin Bolding A common scheme for preventing deadlock in networks is the virtual channel method of Dally and Seitz. Due to the nature of this scheme, an otherwise completely uniform network will have non-uniformities introduced into it. The variations introduce several effects, ranging >from limitations on overall network performance to differences in observed network characteristics from node to node and from message to message. TR # 92-07-09 ``Beyond the Naivety of Grades: Educational Record Keeping for the Twenty-First Century'' Steven L. Tanimoto At its core, education is an information-based activity: teaching is usually thought of as the presentation of information, and learning as the acquisition of information. Yet modern information-management methodologies have been sorely lacking in US public education. For example, the primary record of student progress is the transcript of grades _ a practice from the late nineteenth century that is now thoroughly entrenched as a major tenet of the academic bureaucracy. Today, the needs for recorded information about students' progress and learning styles are much greater, as a result not only of today's diversity of student backgrounds, but because the goals of learning themselves are more diverse, and the educational environment is more decentralized and uncoordinated than ever before. Accurate and detailed information is needed not only about each student's mastery of concepts, vocabulary, problem-solving and other knowledge but also about instructional approaches known to be successful with the student and about extracurricular (e.g., family) problems the student is confronting _ in short, information about anything that ultimately affects the learning process. Management is impossible without information. While it may be possible to make a few incremental improvements to education by "throwing technology at the classroom," major improvements will not happen until the current approach to education is reorganized. Key features of a workable solution are the elimination of grades as we know them, the integration of detailed monitoring and management of each student's progress, and restructuring the record-keeping system in order to instill in each student a new attitude of enthusiastic acceptance of responsibility for his or her own education. TR # 92-08-01 ``Papers Presented at the Microelectronic System Education Conference 1989-91'' Gaetano Borriello, Carl Ebeling, Larry McMurchie, Neil McKenzie The Microelectronic System Education Conference and Exhibition has been held each summer since 1988 to bring together educators dedicated to furthering undergraduate and graduate education in designing and building innovative microelectronic systems. The conference emphasizes the impact of the latest developments in research (CAD, software methodologies, tools and technologies for system building) on university education. This reports collects the papers we have presented at this conference over the three years from 1989 to 1991. They address aspects for both our undergraduate and graduate curricula in digital system design. TR # 92-08-02 ``Predicting Deterministic Execution Times of Real-Time Programs'' C. Y. Park The most distinctive characteristic of a real-time system is that it requires temporal correctness. Therefore, analyzing the timing behavior of a real-time system is essential. In this dissertation, we address this problem by predicting the deterministic execution times of the programs comprising a system. Our approach is to predict a time interval, covering all execution cases, with an analytic method at the source program level. The basic prediction method is based on a formal timing schema for a source-level construct. A simple and efficient timing tool for C programs is implemented in a practical setting. Comparison with measurements shows that predictions are safe and usually tight. However, the tool produces loose predictions for some complex programs because of infeasible paths. For tighter predictions, we analyze the dynamic behavior of a program using information provided by a user. We introduce a formal path model where both a program and user execution information are represented by a set of program paths described by an extended regular expression. Infeasible paths are eliminated by intersecting the two path sets. With a well-defined practical high-level interface language, the path model can be used in an easy and efficient way. We also introduce a method to verify given user information with known program verification techniques. Extended with path analysis, a timing tool yields safe and accurate predictions for a wide range of programs. It also supports incremental refinement where overall prediction cost is scalable with respect to desired precision. As a start for deterministic timing analysis of input/output operations, we introduce a method to predict the execution time of an I/O statement. The main problems in I/O analysis are predicting waiting time for a device or data and analyzing interference caused by asynchronous execution, in a variety of implementation policies. Our approach is to develop a general framework and to apply it to a specific case in a systematic way. We provide a schema refinement rule for each implementation policy that transforms a timing schema to be more specific and thus more predictable in a target system. We also introduce formulas that compute the effect of interference in a target system. We developed methods to predict deterministic program execution times, and validated them by experiments. Predictions are safe, accurate and feasible with reasonable cost. Future work includes experimenting with large programs, analyzing concurrent programs, and reasoning about the timing properties of a system with predictions. TR # 92-08-03 ``Adding Schedular Activations to Mach 3.0'' Paul Barton-Davis, Dylan McNamee, Raj Vaswani, and Edward D. Lazowska When user-level threads are built on top of traditional kernel threads, they can exhibit poor performance or even incorrect behavior in the face of blocking kernel operations such as I/O, page faults, and processor preemption. This problem can be solved by building user-level threads on top of a new kernel entity, the scheduler activation. The goal of the effort described in this paper was to implement scheduler activations in the Mach 3.0 operating system. This is an "implementation" paper rather than a "concepts" paper: we outline the design decisions made, the kernel modifications required, and our additions to the CThreads thread library to take advantage of the new kernel structure. TR # 92-09-01 Double Exponential Inseparability of Robinson Subsystem Q_T ``From the Logically False Formulas in the Language of Addition'' Giovanni Faglia and Paul Young In this work we prove a double exponential time inseparability result for a finitely axiomatizable theory that is weaker than ADDAX. Every set that separates from the logically false sentences of addition is not recognizable by any Turing machine working in double exponential time. This implies also that any theory of addition that is consistent within particular any theory contained in Pr_is at least double exponential time difficult. TR # 92-09-03 ``Monitoring Timing Constraints in Distributed Real-time Systems'' Sitaram C. V. Raju, Ragunathan Rajkumar, Farnam Jahanian In distributed real-time systems, the unpredictability of interactions with the external environment and vio- lations of design assumptions at run-time can lead to unexpected conditions such as system overload and the missing of deadlines. It is highly desirable that these error conditions do not lead to total system failure and that the critical functions of the system are still performed. In this paper, we describe a run-time facility for monitoring distributed real-time systems. In particular, we focus on the problem of detecting violations of timing assertions in a distributed environment in which the real-time tasks run on multiple processors, and timing constraints can be inter-processor or intra-processor constraints. Constraint violations are detected at the earliest possible time by deriving and checking intermediate constraints from the user-specified con- straints. If the violations must be detected as early as possible, the problem of minimizing the number of messages to be exchanged between the processors becomes intractable. We then characterize a sub-class of timing constraints which occur commonly in distributed real-time systems and whose message requirements are minimized easily. We also take into account the drift among the various processor clocks when detecting a violation of a timing assertion. We finally describe a prototype implementation of our distributed run-time monitor on a network of IBM RS/6000 workstations. TR # 92-09-04 ``The Systematic Plan Adaptor: A Formal Foundation for Case-Based Planning'' Steven Hanks and Daniel S. Weld Research efforts in case-based reasoning have made advances in various problem domains, but general principles and domain-independent algorithms have been slower to emerge. We explore the theoretical foundations of case-based planning, in particular characterizing the fundamental tradeoffs that govern the process of plan adaptation. To do so we view the planning process as search through a graph of partial plans: plan gener- ation starts at the graph's root and adds constraints, while plan adaptation starts at an arbitrary place in the graph and can either add or delete constraints. This paper develops a domain-independent algorithm for case-based plan adaptation, SPA, that is sound, complete, and systematic. Since our algorithm is sound, every plan it returns is guaranteed to achieve the goal. Since our algorithm is complete, it is guaranteed to find a plan if one exists, regardless of the initial plan returned from the library. Systematicity guarantees that our algorithm explores the space of partially specified, incomplete plans in a manner which avoids redundant or wasted effort. This paper describes the algorithm and its implementation in some detail, shows how our framework sheds light on existing heuristic, transformational planners, and presents experimental results that demonstrate a systematic relationship between library reuse and computational savings in a simple domain. TR # 92-09-06 ``Implementations and Variations of a Maximum-Flow Algorithm'' Joao C. Setubal We study the push-relabel algorithm for the maximum flow problem and present a number of novel results chiefly concerning its performance in practice. We show that an implementation of the push-relabel algo- rithm using the global relabeling heuristic is by far the fastest for the sparse graphs tested, when compared to four other algorithm implementations. These results show that the push-relabel algorithm is even better in practice than previous studies suggested. We present a new modification of the push-relabel algorithm that allows its efficient implementation in parallel. The modification allows global relabeling to be executed concurrently with push and relabel operations. This technique, combined with special data structures that enable a limited ability for parallel granularity control, result in a parallel program that achieves good speed-ups over the sequential implementation, in a shared-memory multiprocessor. Main code for this im- plementation is presented in an appendix. We extend both sequential and parallel results to the bipartite matching problem. We show that a specialized implementation of the push-relabel algorithm for this problem is competitive with or faster than implementations of two other algorithms that have better worst-case time bounds. A parallel implementation of the push-relabel algorithm for the bipartite matching problem based on the techniques mentioned above was done and achieved modest speed-ups over the sequential im- plementation. Although the speed-ups are small they are significant since the existing parallel algorithms for bipartite matching seem to be quite impractical. Finally we present a new variation of the push-relabel algorithm that allows positive or negative excess flow in active vertices. Although the running time bounds are the same as in the basic push-relabel algorithm, this variation is more versatile, being particularly useful for solving general instances of the dynamic maximum flow problem in practice. TR # 92-09-07 ``Improving the Performance of Message-Passing Applications by Multithreading'' Edward W. Felten and Dylan McNamee Achieving maximum performance in message-passing programs requires that calculation and communication be overlapped. However, the program transformations required to achieve this overlap are error-prone and add significant complexity to the application program. We argue that calculation/communication overlap can be achieved easily and consistently by executing multiple threads of control on each processor, and that this approach is practical on message-passing architectures without any special hardware support. We present timing data for a typical message-passing application, to demonstrate the advantage of our scheme. TR # 92-09-08 ``Techniques for File System Simulation'' Chandramohan A. Thekkath, John Wilkes, and Edward D. Lazowska The design of file and disk systems is an extremely active area. This paper describes a collection of techniques for performing careful simulation-based evaluations of such systems. These techniques have novel aspects in the following areas: workload characterization, file system modeling, and disk behavior representation. They make feasible the detailed simulation of new I/O hardware and file system software, and of extensions to existing designs. In particular, using the techniques described here is likely to make comparative file system studies more accurate. In addition to these specific contributions, the paper makes two broader points. First, it argues that detailed simulations are appropriate and necessary in many cases. Second, it demonstrates that detailed models need not be difficult or time consuming to construct. TR # 92-10-01 ``Generalized B-spline Surfaces of Arbitrary Topological Type'' Charles Teorell Loop B-spline surfaces, although widely used, are incapable of describing surfaces of arbitrary topological type. It is not possible to model a general closed surface or a surface with handles as a single non-degenerate B-spline. In practice such surfaces are often needed. In this thesis, a generalization of bicubic tensor product and quartic triangular B-spline surfaces is presented that is capable of capturing surfaces of arbitrary topological type. These results are obtained by relaxing the sufficient but not necessary smoothness constraints imposed by B-splines and through the use of an n-sided generalization of Bezier surfaces called S-patches. TR # 92-10-02 ``Time-Space Tradeoffs for Graph s-t Connectivity'' Gregory Barnes The problem of graph s-t connectivity, determining whether two vertices s and t in a graph are in the same connected component, is a fundamental problem in computational complexity theory. Determining the space complexity of s-t connectivity either for directed graphs (stcon) or for undirected graphs (ustcon) would tell us a great deal about the relationships among deterministic, nondeterministic, and probabilistic logarithmic space bounded complexity classes. A fruitful intermediate step to determining the space complexity of stcon and ustcon is to explore time-space tradeoffs for the problems: the simultaneous time and space requirements of algorithms for connectivity. Prior to this work, all deterministic connectivity algorithms that used less than linear space (the space bound for well-known algorithms such as breadth-first and depth-first search) required superpolynomial time (e.g. Savitch's algorithm, which uses O(log^2 n) space and n^(O(logn)) time). In this thesis, we explore various techniques for creating small-space, but polynomial-time algorithms for undirected and directed graph connectivity. We present sublinear space, polynomial time deterministic algorithms for both problems. No previous such algorithms were known for these problems; in fact, there was some evidence that such algorithms could not exist for directed graphs. We present a subproblem, the short paths problem, that could separate stcon and ustcon: we present evidence that finding the length of a simple path is intrinsic to solving stcon, but not ustcon, in small space. TR # 92-10-03 ``A Prototyping Environment for Specifying, Executing and Checking Communicating Real-time State Machines'' Sitaram C. V. Raju and Alan C. Shaw We describe a toolset, consisting of a graphical editor, a simulator, and an assertion checker, for prototyping distributed real-time systems that are specified as Communicating Real-time State Machines (CRSMs). CRSMs are timed state machines that communicate synchronously over uni-directional channels. The system behavior of CRSMs is characterized by a time-stamped trace of communication events. Safety and timing assertions on the trace of communication events are expressed in a notation based on Real-Time Logic. We illustrate the novel aspects of the simulator and assertion checker by specifying a traffic-light controller and other real-time systems. TR # 92-10-04 ``A Multi-Level Hierarchical Cache Coherence Protocol for Multiprocessors'' Craig Anderson and Jean-Loup Baer In order to meet the computational needs of the next decade, shared-memory processors must be scalable. Though single shared-bus architectures have been successful in the past, limited bus bandwidth restricts the number of processor that can be effectively put on a single bus machine, One architecture that has been proposed to solve the limited bandwidth problem connects processors using a tree hierarchy of buses. In this paper we present a tool to study a hierarchical bus based shared-memory system. We show that there is a need for transitional states in the cache coherence protocol that are not necessary in single bus architecture. We also give several examples of the protocol in action. Finally, we conclude with some preliminary results, and some examples of how the protocol and architecture could be made more efficient. TR # 92-10-05 ``Improving the Performance of Runtime Parallelization'' Shun-tak Leung and John Zahorjan When the inter-iteration dependency pattern of the iterations of a loop can not be determined statically, compile time parallelization of the loop is not possible. In these cases, runtime parallelization [8] is the only alternative. The idea is to transform the loop into two code fragments: the inspector and the executor. When the program is run, the inspector examines the iteration dependencies and constructs a parallel schedule. The executor subsequently uses that schedule to carry out the actual computation in parallel. In this paper, we show how to reduce the overhead of running the inspector through its parallel execution. We describe two related approaches. The first, which emphasizes inspector efficiency, achieves nearly linear speedup relative to a sequential execution of the inspector, but produces a schedule that may be less efficient for the executor. The second technique, which emphasizes executor efficiency, does not in general achieve linear speedup of the inspector, but is guaranteed to produce the best achievable schedule. We present these techniques, show that they are correct, and compare their performance to existing techniques using a set of experiments. Because in this paper we are optimizing inspector time, by leaving the executor unchanged, the techniques we present have the most dramatic effect when the inspector must be run for each invocation of the source loop. In a companion paper, we explore techniques that build upon those developed here to also improve executor performance. TR # 92-10-06 ``Counting Protocols for Reliable End-to-End Transmission'' Richard Ladner, Anthony LaMarca, Ewan Tempero We introduce a new class of protocols, called counting protocols. These protocols provide a reliable virtual circuit for computer networks in which packets may be delivered out of order. Standard protocols use bounded sequence numbers to identify lost packets. In networks that do not remove very old packets, these protocols may experience wrap-around and fail. Counting protocols use a different approach. Three counting protocols are presented: (i) the one-bit protocol which uses one bit header and sends one packet per message under ideal conditions, but performs poorly in networks with realistic loss rates, (ii) the mode protocol which uses multiple-bit headers and behaves well in networks with a realistic loss rates, and (iii) the mode-power protocol which also uses multiple-bit headers and performs better than the mode protocol on short sequences, but worse on very long sequences. We give the derivations of the three counting protocols and prove the one-bit protocol correct. We do a careful performance analysis of both the one-bit and mode protocols and present the results of a simulation study to demonstrate the performance of the mode-power protocol. TR # 92-10-07 ``Practical Issues in Retiming Latch-Based Circuits'' Brian Lockyear and Carl Ebeling In this paper we examine some of the practical issues that arise when using retiming to optimize real systems. We first propose that retiming should be an integral part of a practical system design methodology and not just a low-level optimization technique. We then remove some of the practical limitations of retiming by showing how to handle clock skew and latch setup time and propagation delays. Finally we describe how to use retiming to generate circuits that are as robust as possible to variations in delay and clock skew parameters. TR # 92-10-08 ``The UW MacTester: A Low-Cost Functional Tester for Interactive Testing and Debugging'' Neil McKenzie, Larry McMurchie and Carl Ebeling We describe a low-cost functional tester that provides a large number of programmable I/O signals for in- teractively testing and debugging chips, boards and subsystems. Testing a subsystem requires only minimal hardware setup which allows the tester to be shared by several users and reduces the overall cost of test equipment. Testing is performed under program control which can be driven either by a set of test vectors or by a user-supplied procedural test program. The tester system consists of the tester unit, a parallel interface card and a personal computer as the host. The tester architecture uses Xilinx FPGAs for the pin electronics and static RAM to enable testing of dynamic circuits. A simple but powerful software interface to the tester makes possible a wide variety of testing and debugging environments. TR # 92-11-01 ``Timing Analysis of Concurrent Systems_An Algorithm for Determining the Time Separation of Events'' Tod Amon, Henrik Hulgaard, Gaetano Borriello, Steve Burns A fundamental problem in the synthesis and optimization of concurrent systems is the determination of the separation time between system events. We present a theoretical framework for solving this problem for arbitrary process graphs without conditional behavior and develop an efficient and exact algorithm based on this theoretical foundation. Examples are used to demonstrate the operation and generality of the algorithm. TR # 92-11-04 ``Message Logging and Data-Parallel Languages'' Alexander C. Klaiber and Henry M. Levy As modern multiprocessor systems increase in scale, the probability of faults increases to the point where the MTBF is lower than the running time of typical applications. It is therefore imperative that such sys- tems provide a means of restarting computations after system failures. So far, this has usually been the responsibility of the application programmer. In this paper, we review several proposed general-purpose restart mechanisms for multiprocessor systems. In particular, we take a closer look at the class of optimistic message logging schemes and study the I/O overhead incurred by one representative implementation. Our benchmarks demonstrate that this overhead becomes extremely large for applications with high message traffic. Further, much of the logged information is redundant. We then propose an alternative recovery scheme that relies on the fact that parallel programs written in a data-parallel language can be compliled in such a way as to be independent of message arrival order. Our scheme achieves the benefits of message logging such as fast output commit at significantly lower overhead. TR # 92-11-05 (also UCI TR #92-108) ``An Investigation of the Therac-25 Accidents'' Nancy G. Leveson and Clark S. Turner Risk in any complex technology is unavoidable. However, important lessons can be learned from accidents which can be used to design procedures for reducing risk in the future. Although descriptions of the Therac-25 medical electron accelerator accidents have been published previously, they are incomplete and often misleading. This paper contains a detailed account of these accidents, along with some lessons that can be learned from them in terms of system engineering, software engineering, and government regulation of safety-critical systems involving software. TR # 92-11-06 ``Surface Approximation Using Geometric Hermite Patches'' Stephen Mann Many surfaces have mathematically complex or computationally expensive representations. For some applications, an approximation to these surfaces is adequate. Numerous tangent plane smooth surface interpolants are reviewed, and inadequacies of these schemes for the approximation of surfaces are investigated. Two high-order-of-approximation surface patches are presented and used to construct continuous, approximating surfaces. The use of these patches is demonstrated for approximating offset surfaces, algebraic surfaces, and S-patches. TR # 92-11-07 ``A Self-Timed Counter II'' Ted Kehl This report is a new version of a previous technical report which was never issued. Specifically, optimization of the original design resulted in about a 40% speed-up. TR # 92-11-08 ``A Self-Timed Comparand Register'' Ted Kehl A comparand register is simply a register with additional circuitry arranged such that the contents of the register are compared, for equality, with an input value. This comparand register has to additional features: It is self-timed and has an inverter chain for high resolution timing. TR # 92-12-01 ``Bounds on Sample Space Size for Matrix Product Verification'' Donald D. Chinn, Rakesh K. Sinha We show that the size of any sample space that could be used in Freivalds' probabilistic matrix product verification algorithm for nxn matrices is at least (n-1)/epsilon if the error probability is at most epsilon. We also provide a characterization of any sample space for which Freivalds' algorithm has error probability at most epsilon. We then provide a generalization of Freivalds' algorithm and provide matching lower and upper bounds on the error probability in terms of the sample space size and running time. TR #92-12-02 ``Asynchronous Epoch Management in Replicated Databases'' Michael Rabinovich and Edward D. Lazowska The paper describes a new dynamic protocol for managing replicated data. Like the existing dynamic schemes, our protocol involves reconfiguring the system dynamically to reflect failures and repairs as they occur, so that the data may be kept available for user operations even if only one replica of the data remains accessible. However, in the existing schemes, it is required for the correctness of the protocol that the system reconfiguration either run as a special transaction that competes for locks with user operations and thus can interfere with and delay the latter, or be done within the protocol for the write operation, which increases the cost of writes and makes fault tolerance of the system dependent on the rate of write operations. In contrast, our protocol allows system reconfiguration to be done asynchronously with and separately from read and write operations. Therefore, in our protocol, user operations can execute any time while system reconfiguration is in progress, with no interference. At the same time, the cost of write operations remains low and the fault tolerance of the system does not depend on the rate of writes. TR # 92-12-03 ``A High-Speed Channel Controller for the Chaos Router'' Robert Wille The channel controller in the chaos router requires that all the routers in a network be in phase with each other. Furthermore, data that is sent from one chip must be received by another within a single cycle. These requirements cause the interconnect delay to effectively limit the speed at which the router can operate. This report discusses the design and implementation of a phase-adjusting controller which eliminates this limiting factor. The phase-adjusting controller is able to pipeline data on the interconnect and resynchronize it as it is received. Therefore, the data is not required to be received within a single cycle and the interconnect ceases to be the speed-limiting factor. The phase-adjusting controller is able to increase the bandwidth by up to 40 percent while decreasing the latency by up to 32 percent. In addition, the phase-adjusting controller is tolerant to clock skew, eliminating the need for phase-locked loops to keep the clocks in phase with each other. TR # 92-12-04 ``The Portability of Parallel Programs Across MIMD Computers'' Calvin Lin Parallel processing is important to the future of high performance computing. At present, the main imped- iment to parallel computing is the high cost of parallel software. These costs can be reduced by creating portable programs. But while there have been many proposed approaches to portable parallel programming, none has yet been demonstrated to be effective. This thesis tests the effectiveness of one particular approach - for MIMD computers - that consists of the CTA machine model, the Phase Abstractions programming model, and the Orca C programming language. The CTA defines the algorithm designer's view of a machine, the Phase Abstractions provide an abstraction of the CTA that guides the construction of programs, and Orca C provides a language for expressing the programmer's solutions. We begin by presenting an experimental study that compares the performance of shared memory and non-shared memory programs on shared memory machines. These results show that the non-shared memory programming model, which is based on the CTA machine model, has wide applicability. We then study a large scientific benchmark, showing that a Phase Abstractions implementation of SIMPLE is portable across a variety of MIMD computers. We also use this example to describe the Orca C language and show how it leads to flexible programs that can adapt to different hardware environments through parameterization. Because of the huge diversity of parallel architectures, this flexibility is crucial to supporting portability. Finally, we consider two "real" applications, matrix multiplication and the Modified Gram-Schmidt method of solving QR factorization. Because these studies involve algorithm design, they allow us to indirectly observe the role that the CTA machine model plays in achieving portability. We develop machine independent algorithms for these two applications and present performance results for various parallel computers. TR # 92-12-05 ``Relaxed Consistency and Synchronization in Parallel Processors'' Rick Zucker Parallel programs often do not obtain close to linear speed-up when compared to a sequential version of the program running on a uniprocessor. There are many reasons that linear speed-up is not obtained. Two important ones are the overhead of synchronization and memory latency. Synchronization, the coordination of the work done by different processors, is an overhead that does not exist in uniprocessor programs. Therefore, excessive time spent performing synchronization leads to a loss of performance. Many previous studies to evaluate this overhead have used artificial benchmarks with high levels of lock contention. In this dissertation I study both the effects of synchronization on the performance of real parallel programs and the impact of the efficiency of the implementation of the synchronization algorithm. The results show that the frequency of synchronization is the most significant factor leading to performance loss. When synchronization occurs sufficiently often, the implementation algorithm has a non-negligible effect. Memory latency, the length of time from when a request to memory is initiated until it completes, is a major problem in multiprocessors. Many hardware and software enhancements have been proposed to deal with the problem. One of the ideas is relaxed models of memory consistency. Relaxed models, such as weak ordering or release consistency, replace sequential consistency, the usual intuitive model of how the memory of the system is implemented. With this change in the memory model, many architectural features can now be used that are not allowed under sequential consistency, but at the cost of imposing constraints to the programmer of parallel systems. In this dissertation I consider many of these architectural features such as bypassing, lock-up free caches and a software controlled cache coherence scheme I propose. I attempt to determine the performance benefits of using such features and which features provide the most benefit. The results show that relaxed consistency can provide significant performance gains for some programs and architectures. The choice of a given relaxed model does not significantly affect the gains. Software controlled cache coherence, a scheme that requires a smaller hardware investment, can provide equivalent performance in some cases and competitive performance in others. TR # 92-12-06 ``A Direct Version of Shamir and Snir's Lower Bound'' Prasoon Tiwari and Martin Tompa We present direct proofs of the following results of Shamir and Snir [Mathematical System Theory, 13, 301-322, 1980] on the depth of monotone arithmetic circuits over rings of characteristic 0: (1) an Omega((logp)(logn)) lower bound for computing the product of p nxn matrices; and (2) an Omega(n) lower bound for computing the permanent of a n x n matrix. coloring problem, which is solved heuristically. Two methods to do so are described. Furthermore, the basic runtime parallelization scheme for shared-memory multiprocessors pays no attention to locality when making these decisions. The performance implications of reordering are examined experimentally on a KSR1 parallel machine as well as through a simple analytic model of execution time. TR # 92-12-07 "Reordering Iterations in Runtime Loop Parallelization" Shun-Tak Leung and John Zahorjan When a loop in a sequential program is parallelized, it is normally guaranteed that all flow dependencies and anti-dependencies are respected so that the result of parallel execution is always the same as sequential execution. In some cases, however, the algorithm implemented by the loop allows the iterations to be executed in a different sequential order than the one specified in the program. This opportunity can be exploited to expose parallelism that exists in the algorithm but is obscured by its sequential program implementation. In this paper, we show how parallelization of this kind of loop can be integrated into the runtime parallelization scheme of Saltz et al. [17, 18]. Runtime parallelization is a general technique appropriate for loops whose dependency structures cannot be determined at compile time. The compiler generates two pieces of code: the inspector examines dependencies at run time and computes a parallel schedule; the executor executes itertions in parallel according to the computed schedule. In our case, the inspector has to solve two problems: choosing an appropriate sequential order for the iterations and computing a parallel schedule. The two problems are treated as a single graph TR # 92-12-08 "Theory and Practice of Vector Quantizers Trained on Small Training Sets" David Cohn, Eve A. Riskin, Richard Ladner We examine how the performance of a memoryless vector quantizer changes as a function of its training set size. Specifically, we study how well the training set distortion predicts test distortion when the training set is a randomly drawn subset of blocks from the test or training image(s). Using the Vapnik-Chervonenkis dimension, we derive formal bounds for the difference of test and training distortion of vector quantizer codebooks. We then describe extensive empirical simulations that test these bounds for a variety of bit rates and vector dimensions, and give practical suggestions for determining the training set size necessary to achieve good generalization from a codebook. We conclude that, by using training sets comprised of only a small fraction of the available data, one can produce results that are close to the results obtainable when all available data are used.