DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF WASHINGTON TECHNICAL REPORT ABSTRACTS - 94-02-01 -> 94-05-07 Obtaining Electronic Technical Reports: Technical reports in this list with an asterisk (*) following the TR number are available on-line as of April 1995 via FTP (ftp.cs.washington.edu), or the WWW (http://www.cs.washington.edu/research/tr/techreports.html) Obtaining Hardcopy Technical Reports: To obtain a technical report on paper, sent to you via Snail Mail, please contact . If you need additional information or assistance, please contact: ABSTRACTS TR # 94-02-01 * "Restructuring Arrays for Efficient Parallel Loop Execution" Shun-Tak Leung and John Zahorjan In a sequential program, data are often structured in a way that is optimized for a sequential execution. However, when the program is parallelized, the data access pattern may change drastically. If the structure of the data is not changed accordingly, parallel performance will suffer. In this paper, we consider this problem in the context of runtime loop parallelization which is a general technique to parallelize loops not amenable to compile-time analysis. In a parallel execution of a loop, iteration may be performed in a very different order than in the sequential execution. This may result in undesirable cache effects on distributed shared-memory multiprocessors, unless the structure of the arrays accessed by these iterations is changed accordingly. We discuss what these problems are and how they arise. We then describe two data restructuring techniques to address them: the restructuring of read-write arrays to reduce inter-processor communication due to false sharing, and the restructuring of read-only arrays to improve spatial locality. We also report experiments on a KSR1 to evaluate the effectiveness of these techniques and the preprocessing and postprocessing overheads they entail. The results show that the restructuring techniques can substantially improve performance of the parallelized loop. When restructuring overheads are ignored, we see a doubling of parallel speedups. While restructuring overheads can be quite significant, they can often be amortized across multiple loop executions so that they do not outweigh the performance benefits. In our experiments, it only takes two loop executions to achieve this. TR # 94-02-02 * "An Algorithm for Exact Bounds on the Time Separation of Events in Concurrent Systems" Henrik Hulgaard, Steven M. Burns, Tod Amon, Gaetano Borriello Determining the time separation of events is a fundamental problem in the analysis, synthesis, and optimization of concurrent systems. Applications range from logic optimization of asynchronous digital circuits to evaluation of execution times of programs for real-time systems. We present an efficient algorithm to find exact (tight) bounds on the separation time of events in an arbitrary process graph without conditional behavior. This result is more general than the methods presented in several previously published papers as it handles cyclic graphs and yields the tightest possible bounds on event separations. The algorithm is based on a functional decomposition technique that permits the implicit evaluation of an infinitely unfolded process graph. Examples are presented that demonstrate the utility and efficiency of the solution. The algorithm will form a basis for exploration of timing-constrained synthesis techniques. TR # 94-02-03 * "On Scalable State-Based Specifications for Real-Time Systems" Alan Shaw Using our communicating real-time state machine (CRSM) language as a basis, we propose and develop a methodology for specifying requirements and designs for large real-time systems. CRSMs are distributed state machines with novel and general timing facilities and CSP-like synchronous communications. The paper first presents a particular controller-client (CC) architecture for composing CRSMs into larger components and then uses the CC organization to define a number of standard in-the-large paradigms for real-time and other software. TR # 94-02-04 * "The Case for Chaotic Adaptive Routing" Kevin Bolding, Melanie L. Fulgham, Lawrence Snyder Chaotic routers are randomizing, non-minimal adaptive packet routers designed for use in the communication networks of parallel computers. Chaotic routing is reviewed along with other contemporary network routing approaches, including the state-of-the-art oblivious routers. Each routing approach is evaluated for its effectiveness as a multicomputer message router. The results indicate that the Chaos router is the most effective of known routing methods. TR # 94-02-05 * "Identifying Profitable Specialization in Object-Oriented Languages" Jeffrey Dean, Craig Chambers, and David Grove The performance of object-oriented languages can be greatly improved if methods can be specialized for particular classes of arguments. Such specialization can provide the compiler with enough class information about the receivers of messages within the specialized routine to enable these messages to be statically-bound to their target methods and subsequently inlined. We present an algorithm for automatically determining which methods are most profitable to specialize for which argument classes. This algorithm improves on previous automatic techniques by avoiding the twin problems of over- and underspecialization and by being suitable for specializing programs that use multi-methods. TR # 94-03-01 "Typechecking and Modules for Multi-Methods" Craig Chambers and Gary T. Leavens Two major obstacles preventing the wider acceptance of multi-methods are concerns over the lack of encapsulation and modularity and the lack of static typechecking in existing multi-method-based languages. This paper addresses both of these problems. We present a polynomial-time static typechecking algorithm that checks conformance, completeness, and consistency of a group of method implementations with respect to declared message signatures. This algorithm improves on previous algorithms by handling separate type and inheritance hierarchies, the presence of abstract classes, and graph-based method look-up semantics. We prove formally that our algorithm fulfills its specification. We also present a module system that enables independently-developed code to be fully encapsulated and statically typechecked on a per-module basis. To guarantee that potential conflicts between independently-developed modules have been resolved, a simple well-formedness condition on the molecules comprising a program is checked at link-time. The typechecking algorithm and module system are applicable to a range of multi-method-based languages, but the paper uses the Cecil language as a concrete example of how they can be applied. TR # 94-03-02 "A New Data Structure for Fast Approximate Matching" Andrew P. Berman Given a set of objects S and a metric D, we describe how to represent S as a new data structure, the "triangulation trie". This data structure can be used to search through S quickly to find approximate matches to a given object. Using the triangle inequality, the search tree is repeatedly pruned to reduce the number of object comparisons required. Much of the work is done within the tree using integer comparisons. This method can result in very fast database searches in applications where object comparisons are traditionally costly. Furthermore, the data structure seems to be applicable to a very wide variety of object types. The trie is unusual in its construction in that objects are partitioned according to their respective distances from a common set of "key" objects. TR # 94-03-03 * "SPIN - An Extensible Microkernel for Application-specific Operating System Services" Brian Bershad, Craig Chambers, Susan Eggers, Chris Maeda, Dylan McNamee, Przemyslaw Pardyak, Stefan Savage, Emin Gun Sirer Application domains, such as multimedia, databases, and parallel computing, require operating system services with high performance and high functionality. Existing operating systems provide fixed interfaces and implementations to system services and resources. This makes them inappropriate for applications whose resource demands and usage patterns are poorly matched by the services provided. The SPIN operating system enables system services to be defined in an application-specific fashion through an extensible microkernel. It offers fine-grained control over a machine's logical and physical resources to applications through run-time adaptation of the system to application requirements. TR # 94-03-04 * "Interface Timing Verification with Combined MAX and LINEAR Constraints" Elizabeth A. Walkup, Gaetano Borriello A fundamental timing analysis problem in the verification and synthesis of interface logic circuitry is the determination of the possible and allowable time separations, or "skews" between interface events, given timing constraints and propagation delays between the events generated by the circuits the interface connects. These skews are used to verify timing properties and determine allowable propagation delays for logic synthesis. The main contributions of this report are two-old. First, this report shows that the verification problem can be expressed with constraints of the form x_i <= Max{x_{j1} + Delta_{j1,i}, ..., x_{jm} + Delta_{jm,i}}, such as those described in several other domains including the {Max, +} algebra used in modeling discrete event systems. Second, this report presents and proves correct an algorithm that provides tight upper bounds on the time separation between all pairs x_i, x_j for such constraint set in less time and with tighter bounds than previous algorithms. TR # 94-03-05 * "Measurement and Application of Dynamic Receiver Class Distributions" Charles D. Garrett, Jeffrey Dean, David Grove, Craig Chambers Dynamic binding slows down object-oriented programs. Dynamic dispatch mechanisms which work well where all receiver classes are equally likely are too pessimistic because at most call sites one receiver predominates. We apply dynamic profile information to determine the dynamic execution frequency distributions of the classes of receivers at call sites. We show that these distributions are heavily skewed towards the most commonly occurring receiver class across several different languages. Moreover, we show that the distributions are stable across program inputs, from one version of a program to another, and even to some extent across programs that share library code. Finally, we demonstrate that significant run-time performance improvements for object-oriented programs can be gained by exploiting the information contained in dynamic receiver class distributions in a relatively simple optimizing compiler. TR # 94-03-06 * "Testing Asynchronous Circuits: A Survey" Henrik Hulgaard, Steve Burns, Gaetano Borriello Asynchronous circuit design has been studied for decades, but it has only recently been feasible to construct large and efficient asynchronous systems. This paper surveys different techniques for checking whether an asynchronous circuit has fabrication defects. The inherent differences between asynchronous and synchronous circuits, primarily that asynchronous circuits do not have a global clock, necessitate a review of the testing techniques used for synchronous circuits and a re-evaluation of the trade-offs involved. New methods that efficiently utilize the structure of asynchronous circuits are possible, most notably self-checking asynchronous circuits can be implemented with little or no circuit overhead. 94-03-07 * "Constraints and Object Identity" Gus Lopez, Bjorn Freeman-Benson, and Alan Borning Constraint imperative programming is an integration of declarative constraints and imperative object-oriented programming. The primary goal of this integration is to use constraints to express relations among objects explicitly--relations that were implicit in the code in previous languages. However, one of the fundamental concepts of object-oriented programming, object identity, can result in implicit relations, even when explicit identity constraints are supported. We analyze the problem and propose a solution--identity constraints--which we have implemented in our Kaleidoscope'93 language. This solution is understandable, efficiently implementable, and compatible with the Kaleidoscope constraint model. TR # 94-04-01 * "Pin Assignment for Multi-FPGA Systems" Scott Hauck and Gaetano Borriello There is currently great interest in using systems of FPGAs for logic emulators, custom computing devices, and software accelerators. An important step in making these technologies more generally useful is to develop completely automatic mapping tools from high-level specification to FPGA programming files. In this paper we examine one step in this automatic mapping process, the selection of physical FPGA pins to use for routing inter-FPGA signals. We present an algorithm that greatly increases mapping speed while also improving mapping quality. TR # 94-04-02 * "A Group Structuring Mechanism for a Distributed Object-oriented Language" Przemyslaw Pardyak and Brian N. Bershad This paper describes a structuring mechanism for grouping objects in a distributed object-oriented language. A group structuring mechanism provides a single flexible method for managing distributed applications that involve complicated communication protocols and sophisticated structure. We have added such a mechanism to the Emerald distributed object-oriented language and its runtime system. Our group structuring mechanism fits entirely within the context of object-oriented programming, so similar mechanisms could be added to other distributed object-oriented languages. TR # 94-05-01 "An Architecture-Adaptive, Performance-Driven Router for FPGAs" Larry McMurchie, Carl Ebeling, and Gaetano Borriello We present an architecture-adaptive, performance-driven router for FPGAs. In this router, delay minimization and congestion elimination are balanced on an iteration-by-iteration basis. The signal ordering dependency that is characteristic of most obstacle avoidance routers is reduced by allowing signals to share nodes and adjusting the cost of those nodes in a semi-equilibrium fashion. Because it requires only a directed graph to describe the architecture of routing resources, the router adapts readily to a wide variety of FPGAs architectures, such as Triptych, Xilinx 3000 and mesh connected arrays of FPGAs. The results of routing ISCAS benchmarks on the Triptych FPGA architecture show an average increase of only 4.5% in critical path delay from the optimum delay for the placement. Routes of ISCAS benchmarks on the Xilinx 3000 architecture show a greater completion rate than commercial tools as well as 11% shorter critical paths. TR # 94-05-02 * "Design and Evaluation of a Subblock Cache Coherence Protocol for Bus-Based Multiprocessors" Craig Anderson and Jean-Loup Baer Parallel applications exhibit a wide variety of memory reference patterns. Designing a memory architecture that serves all applications well is not easy. However, because tolerating or reducing memory latency is a priority in effective parallel processing, it is important to explore new techniques to reduce memory traffic. In this paper, we describe a snoopy cache coherence protocol that uses a large sized transfer block (to take advantage of spatial locality) while using a small coherence block in order to avoid false sharing. To further illustrate the protocol, we present an example of its workings. We then present the results of simulating our protocol on 5 applications that exhibit a variety of reference patterns. We find that our protocol effectively takes advantage of spatial locality while avoiding the increase in false sharing that often occurs when using large line sizes. TR # 94-05-03 "Self-Tuned Clocks and Crystal Clocks" Ted Kehl, Steve Burns, Chris Fisher Self-tuning measures the maximum clocking rate of a circuit and then adjusts the clock to run at nearly this rate. Self-tuning can dramatically improve performance, for example, experiments showed that self-tuned DRAMs can operate at about half the read access time of those with conventional control. In this paper we present experiments showing why a self-tuned clock -- an asynchronous stop/start clock -- is preferable to a crystal clock in that the self-tuned clock can operate closer to the maximum clocking speed of a circuit than can a a crystal while guaranteeing glitch-free synchronization. The first part of this paper describes the self-tuned clock and its glitch-free synchronization. The second part describes experiments supporting our contention that the self-tuned clock is better able to operate close to the maximum speed of the circuitry than is a crystal oscillator. The third part shows that this new clock is more efficient than a conventional synchronizer. The fourth part concerns arbitration for this clock from many sources. TR # 94-05-04 "Using Assertions for Validating, Verifying and Monitoring Real-Time Systems" Sitaram C. V. Raju Real-time systems are different from traditional computer systems because they have timing constraints that must always be met. Designing correct real-time systems is not an easy task because it is simply not possible to exhaustively test any large real-time system. In this dissertation we attempt to address this problem by using assertions for validating, verifying and monitoring real-time systems. We first present a prototyping environment for an executable specification scheme called Communicating Real-Time State Machines (CRSMs). The system behavior of CRSMs is characterized by a time-stamped trace of communication events. We present a novel assertion language for specifying safety and timing assertions on the trace of communication events. The prototyping environment consists of a graphical editor for describing CRSMs, a CRSM simulator to observe the execution of the prototype, and an assertion checker for monitoring assertions during simulation. We then describe an automatic verification technique for real-time systems modeled by an expressive subclass of CRSMs. The proposed approach is to represent the behavior of the system by a finite, timed reachability graph. We provide a decision procedure for verifying timing and safety properties (specified in our assertion language) of the reachability graph. We describe an implementation of the verification technique and demonstrate its usefulness on example real-time specifications. Even if a real-time system has been formally specified, verified and then implemented in software it is important that the implementation check for constraint violations states at run-time. The reason is that unexpected conditions at run-time must not cause the real-time system to fail. We describe a run-time environment for monitoring timing constraints in distributed real-time systems. Timing constraints are specified in a subset of our assertion language. Constraint violations are detected at the earliest possible time by deriving and checking intermediate constraints from the user-specified constraints. The major contributions of this dissertation are: a method for checking CRSM simulations, an automatic verification technique for CRSMs, and the development of a run-time monitor for real-time applications. Future work includes the development of a scalable verification method and experimenting with large real-time applications. TR # 94-05-05 "A Performance Evaluation of Lock-Free Synchronization Protocols" Anthony LaMarca In this paper, we investigate the practical performance of lock-free techniques that provide synchronization on shared-memory multiprocessors. Our goal is to provide a technique to allow designers of new protocols to quickly determine an algorithm's performance characteristics. We develop a simple analytical performance model based on the architectural observations that memory accesses are expensive, synchronization instructions are more expensive, and that optimistic synchronization policies result in wasted communication bandwidth which can slow the system as a whole. Using our model, we evaluate the performance of five existing lock-free synchronization protocols. We validate our analysis by comparing our results with simulations of a parallel machine. Given this analysis, we identify those protocols which show promise of good performance in practice. In addition, we note that no existing protocols provide insensitivity to common delays while still offering performance equivalent to locks. Accordingly, we introduce a protocol, based on a combination of existing lock-free techniques which satisfies these criteria. TR # 94-05-06 "The Design of an Interactive Visualization Tool for Exploring Image Convolution" Alex Rothenberg Much of traditional mathematics education has revolved around the teaching of formal terminology leaving students unaware of its context or relevance. This paper argues that visualization can be used as a technique to expose the concepts behind the terminology. The paper focuses on how visualization can be applied to image convolution, then describes "the convolver", a tool applying these ideas implemented under Microsoft Windows. Using high level graphical interfaces, it is possible for students to directly experience the concepts bypassing much of the formal terminology. Finally, possible uses of the convolver are presented within both a university level image processing course and a K-12 mathematics curriculum. TR # 94-05-07 * "A Self-Accelerating Packet Service Discipline for Low-Delay Service to Bursty Flows" Ricardo Pincheira In this paper, we consider the transmission of compressed video over networks. We develop a packet service discipline called Axel that provides very low service delays to bursty flows while providing guaranteed throughput to all flows in the system. In contrast to existing policies capable of providing bandwidth guarantees, Axel can provide low delay service to bursty flows without the need for resource over-reservation. We study the behavior of Axel through analysis and simulation, and conclude that Axel meets its stated goals. The policy is targeted at a network environment composed of a mixture of bursty, delay sensitive traffic and smoother, delay insensitive traffic. We expect such an environment to arise from a mixture of interactive, compressed video streams (eg., tele-conferencing) and non-interactive compressed video streams (eg., video-on-demand).