DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF WASHINGTON TECHNICAL REPORT ABSTRACTS - FROM 93-01-01 TO 93-05-08 TR # 93-01-01 "Mesh Optimization" Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDonald and Werner Stuetzle We present a method for solving the following problem: Given a set of data points scattered in three dimensions and an initial triangular mesh M-{o}, produce a mesh M, of the same topological type as M-{o}, that fits the data well and has a small number of vertices. Our approach is to minimize an energy function that explicitly models the competing desires of conciseness of representation and fidelity to the data. We show that mesh optimization can be effectively used in at least two applications: surface reconstruction from unorganized points, and mesh simplification (the reduction of the number of vertices in an initially dense mesh of triangles). TR # 93-01-02 "Hierarchical Constraint Logic Programming" Molly Wilson and Alan Borning Constraint Logic Programming (CLP) is a general scheme for extending logic programming to include constraints. It is parameterized by D, the domain of the constraints. However, CLP(D) languages, as well as most other constraint systems, only allow the programmer to specify constraints that must hold. In many applications, such as interactive graphics, planning, document formatting, and decision support, one needs to express preferences as well as strict requirements. If we wish to make full use of the constraint paradigm, we need ways to represent these defaults and preferences declaratively, as constraints, rather than encoding them in the procedural parts of the language. We describe a scheme for extending CLP(D) to include both required and preferential constraints . An arbitrary number of strengths of preference are allowed. We present a theory of such constraint hierarchies, and an extension, Hierarchical Constraint Logic Programming, of the CLP scheme to include constraint hierarchies. We give an operational, model theoretic and fixed-point semantics for the HCLP scheme. Finally, we describe two interpreters we have written for instances of the HCLP scheme, give example programs, and discuss related work. TR # 93-02-01 "Time-Space Tradeoffs for Undirected Graph Traversal" Paul Beame, Allan Borodin, Prabhakar Raghavan, Walter L. Ruzzo, & Martin Tompa We prove time-space tradeoffs for traversing undirected graphs, using a variety of structured models that are all variants of Cook and Rackoff's "Jumping Automata for Graphs". Our strongest tradeoff is a quadratic lower bound on the product of time and space for graph traversal. For example, achieving linear time requires linear space, implying that depth-first search is optimal. Since our bound in fact applies to nondeterministic algorithms for nonconnectivity, it also implies that closure under complimentation of nondeterministic space-bounded complexity classes is achieved only at the expense of increased time. To demonstrate that these structured models are realistic, we also investigate their power. In addition to admitting well known algorithms such as depth-first search and random walk, we show that one simple variant of this model is nearly as powerful as a Turing machine. Specifically, for general undirected graph problems, it can simulate a Turing machine with only a constant factor increase in space and a polynomial factor increase in time. TR # 93-02-02 "Building FIFO and Priority-Queuing Spin Locks from Atomic Swap" Travis S. Craig We present practical new algorithms for FIFO or priority-ordered spin locks on shared-memory multiprocessors with an atomic swap instruction. Different versions of these queuing spin locks are designed for machines with coherent-cache and NUMA memory models. We include extensions to provide nested lock acquisition, conditional locking, timeout of lock requests, and preemption of waiters. These locks apply to both real-time and non-real-time parallel systems and we include a comparison of the traits of several lock schemes aimed at those environments. Our main technical contributions are our techniques and algorithms that provide tight control over lock grant order, use only the atomic swap instruction, use at most one (local only) spin for lock acquisition and no spinning for lock release, and need only O(L + P) space on either a coherent-cache or NUMA machine. TR # 93-02-04 "A Note on the Least Common Multiples of Dense Sets of Integers" Wendy Thrash No abstract for this report. TR # 93-02-05 "Simulation of Multiprefix PRAMs by Unbounded Fan-in Circuits" Paul Beame and Rakesh Sinha We show that PRAMs that have multiprefix (scan) operations available as primitives may be simulated by unbounded fan-in circuits with gates for related operations. The complexity bounds depend on the word-size of the PRAM but not on the multiprefix operations or the instruction set of the processors. TR # 93-03-01 "Implementing Network Protocols at User Level" Chandramohan A. Thekkath, Thu D. Nguyen, Evelyn Moy and Edward D. Lazowska Traditionally, network transport protocols have been implemented in the kernel due to performance and security concerns. However, considerations of code maintenance, ease of debugging, customization, and the simultaneous existence of multiple protocols argue for separating the implementation into more manageable user-level libraries. Our approach does not sacrifice performance, protection, or functionality compared to monolithic implementations. We begin by motivating the need for user-level protocol implementations and placing our approach in the context of previous work. We then describe our system, which has been implemented on Mach workstations connected not only to traditional Ethernet, but also to a more modern network (the DEC SRC AN1). Based on our experience, we discuss the implications for host-network controller design and for overall system structure to support efficient user-level implementations of communication protocols. TR # 93-03-02 "A Performance Study of a New Grid Protocol and General Grid Structures for Replicated Data" Akhil Kumar, Michael Rabinovich and Rakesh Sinha Recently there has been considerable interest in the study of replica control protocols which are based on organizing several copies of an object into logical structures, such as rectangular grids. In addition to high availability, another objective in exploiting such structures is to improve the degree of load sharing in a system. In this paper, we extend the scope of grid structures to general grids, which allow holes in various positions of a rectangular structure and are useful to consider because they often produce availabilities that are higher than solid grids, where every position must be occupied by a node. In addition to proposing an improvement to the existing grid protocol, we also offer new insights into the performance of the grids, from both availability and load sharing points of view. Algorithms for designing grinds to maximize availability independently in conjunction with a load sharing constraint are given. TR # 93-03-03 "Latency Analysis of TCP on an ATM Network" Alec Wolman, Geoff Voelker, and Chandramohan A. Thekkath In this paper, we characterize the latency of TCP on an ATM network. Latency reduction is a difficult task, and careful analysis is the first step towards reduction. We investigate the impact of both the network controller and the protocol implementation on latency. We find that a low latency network controller has a significant impact on the overall latency even for a reliable transport protocol such as TCP, and that replacing the ULTRIX TCP implementation with the BSD 4.4 alpha implementation improves the latency up to 20%. We also characterize the impact on latency of some widely discussed improvements to TCP, such as header prediction and combining the checksum calculation with data copying. TR # 93-03-04 "Communication Complexity: Iterative Techniques for Lower Bounds" Joan T. Lawry We add to the repertoire of lower bound techniques for the two-party communication complexity of Boolean functions by adapting the iterative lower bound technique previously used for search problems to decision problems. Using this approach, we show the existence of a Boolean function whose deterministic complexity is the square of its nondeterministic complexity, with epsilon-error randomized complexity no better than deterministic complexity - within a constant factor. However, our input domain is only a portion of the set that would usually be associated with our function in the two-party communication model. Nonetheless, our result suggests that for the class of functions with maximal separation between deterministic and nondeterministic complexity of a decision problem is nontrivial in that a randomized protocol for a decision problem has a significantly lower tolerance for error than that for a search problem. TR # 93-03-05 "The Cecil Language - Specifications and Rationale" Craig Chambers Cecil is a new purely object-oriented language intended to support rapid construction of high-quality, extensible software. Cecil combines multi-methods with a classless object model, object- based incapsulation, and optional static type checking. Cecil's static type system distinguishes between subtyping and code inheritance, but Cecil enables these two graphs to be described with a single set of declarations, optimizing the common case where the two graphs are parallel. Cecil includes a fairly flexible form of parameterization, including both explicitly parameterized objects, types, and methods and implicitly parameterized methods related to the polymorphic functions commonly found in functional languages. By making type declarations optional, Cecil aims to support mixed exploratory and production programming styles. This document describes the design of the Cecil language as of March, 1993. It mixes the specification of the language with discussion of design issues and explanations of the reasoning that led to various design decisions. TR # 93-03-06 "Pointers versus Arithmetic in PRAMs" Patrick W. Dymond, Faith E. Fich, Naomi Nishimara, Prabhakar Ragde, Walter L. Ruzzo Manipulation of pointers in shared data structures is an important communication mechanism used in many parallel algorithms. Indeed, many fundamental algorithms do essentially nothing else. A Parallel Pointer Machine, (or PPM) is a parallel model having pointers as its principal data type. PPMs have been characterized as PRAMs obeying two restrictions --- first, restricted arithmetic capabilities, and second, the CROW memory access restriction (Concurrent Read, Owner Write, a commonly occurring special case of CREW). We present results concerning the relative power of PPMs (and other arithmetically restricted PRAMs) versus CROW PRAMs having ordinary arithmetic capabilities. First, we prove lower bounds separating PPMs from CROW PRAMs. For example, any step-by-step simulation of an n-processor CROW PRAM by a PPM requires time Omega(log log n) per step. Second, we show that this lower bound is tight --- we give such a step-by-step simulation using O(log log n) time per step. As a corollary, we obtain sharply improved PPM algorithms for a variety of problems, including deterministic context-free language recognition. TR # 93-03-07 "How Reductions to Sparse Sets Collapse the Polynomial-time Hierarchi: A Primer" Paul Young This expository paper is based on lectures given while the author was a Visiting Professor in Dipartimento di Scienze dell'Informazione, Universita degli Studi di Milano in November and December, 1991. The author is grateful for the support given by Consiglio Nazionale delle Ricerche. Presented also at the 4th European Summer School in Logic, Language and Information, Essex, England, August 1992. This revision of September, 1992, corrects a number of inaccurate statements involving Kadin's theorem which appeared in an earlier version of this paper appearing in the Summer, 1992, issue of SIGACT News. TR # 93-04-01 "Adhara: Runtime Support for Dynamic Space-Based Applications on Distributed Memory MIMD Multiprocessors" Immaneni Ashok and John Zahorjan Dynamic space-based applications are simulations of objects moving through a closed k-dimensional space subject to mutual forces. There are a wide variety of such applications, differing in the kinds of objects and forces being simulated. Dynamic space-based applications exhibit strong data locality patterns, but these patterns change during the computation as objects change position in the simulated space. To achieve good performance when run on distributed memory multiprocessors, two conflicting goals much be addressed: the strong spatial data locality must be exploited, and the computational load must be balanced across the processors. These optimizations can be done statically, by either the programmer or the compiler, or dynamically, by handwritten application code or a runtime system. Relying on the programmer for these functions imposes an unreasonable burden on her, as it is a hard and time consuming process to develop the code required. At the same time, the compiler cannot be expected to automatically generate code leading to good performance, since it cannot extract the space-based data dependencies from the program source. Thus, the most appropriate approach to providing the needed functions is through a specialized runtime system that can be used with any dynamic, space-based application. In this paper we describe the design and the programming interface of ADHARA, a runtime system tailored to the needs of dynamic space-based applications. Using a specific plasma physics application as an example, we show that ADHARA enables the development of efficient parallel codes that involve very few additional lines of code and very few additional concepts beyond those required to construct a sequential version of the application. TR # 93-04-02 "Sharing and Protection in a Single Address Space Operating System" Jeffrey S. Chase, Henry M. Levy, Michael J. Feeley, and Edward D. Lazowska The appearance of 64-bit address space architectures, such as the DEC Alpha, HP PA-RISC, and MIPS R4000, signals a radical shift in the amount of address space available to operating systems and applications. This shift provides the opportunity to reexamine fundamental operating system structure; specifically, to change the way that operating systems use address space. This paper explores memory sharing and protection support in Opal, a single address space operating system designed for wide-address architectures. Opal threads execute within protection domains in a single shared virtual address space. Sharing is simplified, because addresses are context independent. There is no loss of protection, because addressability and access are orthogonal; the right to access a segment is determined by the protection domain in which a thread executes. This model enables beneficial code and data sharing patterns that are currently prohibitive, due in part to the inherent restrictions of multiple address spaces, and in part to Unix programming style. We have designed and implemented an Opal prototype using the Mach 3.0 microkernel as a base. Our implementation demonstrates how a single address space structure can be supported alongside of other environments on a modern microkernel operating system, using modern wide-address architectures. This paper justifies the Opal model and its goals for sharing and protection, presents the system and its abstractions, describes the prototype implementation, and reports experience with integrated applications. TR # 93-04-03 "Efficient Support for Multicomputing on ATM Networks" Oren Etzioni, Hank Levy, Richard Segal, and Chandramohan A. Thekkath The emergence of a new generation of networks will dramatically increase the attractiveness of loosely-coupled multicomputers based on workstation clusters. The key to achieving high performance in this enviroment is efficient network access, because the cost of remote access dictates the granularity of parallelism that can be supported. Thus in addition to traditional distribution mechanisms such as RPC, workstation clusters should support lightweight communication paradigms for executing parallel applications This paper describes a simple communication model based on the notion of remote memory access. Applications executing on one host can perform direct memory read or write operations on user-defined remote memory buffers. We have implemented a prototype system based on this model using commercially available workstations and ATM networks. Our prototype uses kernel-based emulation of remote read and write instructions, implemented through unused processor opcodes; thus, applications (or runtime libraries) see direct machine support for remote memory access. We show that this model can be supported safely and efficiently on current systems; for example, a 40-byte remote write operation completes in 30 microseconds on our 140 Mb/s ATM -based network with 19.5 SPECint DECstation hosts. Using example applications, we show how this underlying model can provide excellent performance for both parallel applications and traditional distributed system services. TR # 93-04-04 "OS Agents: Using AI Techniques in the Operating System Environment" Oren Etzioni, Henry Levy, Richard Segal, and Chandramohan Thekkath While recent decades have brought substantial change to the form of the operating system interface, the power of operating system commands has remained nearly constant. Conventional commands, whether visual or textual, specify one particular action to perform. To carry out a complex task, such as reducing disk utilization, the user is forced to explicitly specify each of the necessary steps. Traditional command-language extension mechanisms, such as shell scripts and pipes, enable the user to aggregate and compose various commands, but force him or her to write and debug programs --- a formidable challenge for naive users. This paper presents a goal-oriented approach to the operating system command interface, realized through an implementation we call OS agents. Using OS agents, the user simply specifies a goal to accomplish, and the OS agent decides how to accomplish that goal using its knowledge base of the system state and its commands. The OS agent dynamically synthesizes the appropriate command sequence, issues the required commands and system calls, handles errors, and retries commands if necessary. With OS agents, we have applied AI planning and learning techniques to the operating system environment to increase the power of the user's commands. We have implemented OS agents within a distributed Unix environment. Our experience indicates that it is practical to incorporate novel ideas of automatic planning and learning into contemporary operating systems with a modest amount of work and little performance penalty. In this paper we present OS agents and their operation, and describe the general-purpose mechanism we have provided to flexibly and efficiently support the needs of OS agents within the Unix operating system. TR # 93-04-05 "Vector Prefix Addition on Sub-Bus Mesh Computers" Richard E. Ladner, Jordan Lampe and Richard Rogers We investigate parallel algorithms for the sub-bus mesh computer architecture which is exemplified by the MasPar MP-1 computer. We consider the problems of vector addition (vector "reduce-add") and vector prefix addition (vector "scan-add"). Both scalar addition and scalar prefix addition can be solved in parallel time O(log p) on sub-bus mesh computers with p processors. We present a time analysis of two- dimensional algorithms. We present a time analysis of the algorithms in an abstract model of the sub-bus mesh computer. We then compare our analytical results with the timings from an actual implementation of the algorithms on a MasPar MP-1 with 1024 processors. The comparison demonstrates, to a degree, the validity of the abstract model. TR # 93-04-06 "Fast Data Breakpoints" David Keppel Debuggers allow a user to halt a program and examine its state. The debugger stops execution when some user-specified condition is satisfied: Code breakpoints halt program execution when a particular instruction is executed. Data breakpoints halt program execution when a variable is referenced. Code breakpoints are supported directly in hardware on most machines and are fast. Data breakpoints, however, are notoriously slow. This note describes how data breakpoints can be made fast. TR # 93-04-07 "An Exponential Separation between the Matching Principle and the Pigeonhole Principle" Paul Beame and Toniann Pitassi The combinatorial matching principle states that there is no perfect matching on an odd number of vertices. This principle generalizes the pigeonhole principle, which states that for a fixed bi-partition of the vertices, there is no perfect matching between them. Therefore, it follows from recent lower bounds for the pigeonhole principle that the matching principle requires exponential-size bounded-depth Frege proofs. Ajtai [Ajt90] previously showed that the matching principle does not have polynomial-sized bounded-depth Frege proofs even with the pigionhole principle as an axiom schema. His proof utilizes nonstandard model theory and is nonconstructive. We improve Atjai's lower bound from barely superpolynominal to exponential and eliminate the nonstandard model theory. Our lower bound is also related to the inherent complexity of particular search classes (see[Pap91]). In particular, oracle separations between the complexity of classes PPA and PPAD, and between PPA and PPP also follow from our techniques ([BP93a]). TR # 93-04-08 "An Automatic Verification Technique for Communicating Real-Time State Machines" Sitaram C. V. Raju We describe an automatic verification technique for distributed real-time systems that are specified as Communicating Real-Time State Machines (CRSMs). CRSMs are timed state machines that communicate synchronously over uni-directional channels. The proposed approach is to model the behavior of the system of (an expressive subclass of) CRSMs by a timed reachability graph. The system behavior of CRSMs is characterized by a time-stamped trace of communication events. We provide a decision procedure for verifying timing and safety properties (specified in a notation based on Real-Time Logic) of the reachability graph, and hence of the corresponding system of CRSMs. We also present a condition for the existence of deadlock in a system of CRSMs. Finally, we briefly describe an implementation of a verifier program based on the above algorithms. TR # 93-04-09 "Building Counting Networks from Larger Balancers" Edward W. Felten, Anthony LaMarca, Richard Ladner We introduce a generalization of the counting networks of Aspnes, Herlihy, and Shavit. Our counting networks are constructed using k-balancers, rather than the 2-balancers of Aspnes et al. For reasonable values of k, k-balancers and 2-balancers can be implemented with equal efficiency on existing computers. Our k-bitonic networks have depths ranging from O(1) to O(log^2 w), where w is the number of inputs to the network. The depth of our networks varies with the choice of k; choosing the optimal k depends on a tradeoff between the desire for a shallow network and the desire for low contention. We present the k-bitonic construction, prove its correctness, and introduce some useful variations to the construction. We then compare the performance of our networks with the networks of Aspnes et al, using simulation of idealized counting networks. Our networks perform at least as well in all cases, and typically have throughput about 25% higher than the networks of Aspnes et al. TR # 93-05-01 "Hierarchical Constraint Logic Programming" Molly Wilson Many problems in computer science involve keeping track of relations. These may be relations among graphical objects on a display, arithmetic relations among numbers, information about which nodes in a graph are adjacent to each other, or requirements about job scheduling, to name just a few. These relationships can quickly grow complex as the number of objects in a system increases. Frequently, these problems can be solved more easily when these relations can be represented declaratively: when there is a language that can capture the underlying constraints on the objects without depending how the actual relations are to be solved. The notion of a constraint language arises from this desire for abstraction. Constraints provide an elegant means of stating relationships and of declaratively characterizing properties among objects that can be maintained by an underlying system. TR # 93-05-02 "A Periodic Object Model for Real-Time Systems" H. R. Callison We introduce time-sensitive objects (TSO's), a data-oriented model for real-time systems. The TSO model imposes time constraints on object values through validity intervals and object histories. Periodic objects, a class of objects within the TSO model, are described in detail and compared with more traditional periodic processes. We identify advantages of periodic objects including greater scheduling independence, more opportunity for concurrency, and tolerance of timing faults. TR # 93-05-03 "The Practical Application of Retiming to the Design of High-Performance Systems" Brian Lockyear, Carl Ebeling Many advances have been made recently in the theory of circuit retiming, especially for circuits that use level-sensitive latches. In spite of this, automatic retiming tools have seen relatively little use in practice. One reason for this is the lack of good speedup results when retiming has been applied to real circuits. Another reason is that retiming has used a rather simple circuit model which reduces its utility in practice. This paper addresses both of these issues. We suggest that the reason for the poor results reported for retiming is that retiming has been applied too late in the design process when there is little flexibility for performance improvement. We give an example of using retiming early in the design process to achieve better performance while at the same time simplifying the design process itself. We then describe an extension to the retiming circuit model that includes clock skew as well as latch propagation delay, setup and hold parameters. Including these parameters allows retiming to generate the fastest circuit subject to a given amount of clock skew, or generate the most robust circuit with respect to skew for a given clock frequency. This gives level-clocked circuits yet another advantage over edge-clocked circuits since edge-clocked circuits require margin for clock skew while level-clocked circuits can be retimed to be inherently skew-tolerant. We illustrate these techniques using a serial-parallel multiplier circuit. TR # 93-05-04 "Minimizing the Effect of Clock Skew Via Circuit Retiming" Brian Lockyear, Carl Ebeling Clock skew is often cited as an impediment to designing high-performance synchronous circuits. Clock skew reduces the performance of a circuit by reducing the time available for computation, and it may even cause circuit failure by allowing race conditions. In this paper, we show how to use retiming to reduce the effect of clock skew in both edge-clocked and level-clocked circuits. We include both fixed clock skew, for example skew which results from clock distribution wiring and buffering, and variable clock skew which results from uncontrolled variation in process parameter and operating conditions. By including fixed skew in the maximum constraint equations retiming can find the fastest circuit given that skew. By including variable skew, retiming can also generate the circuit that tolerates the most variation in clock skew for a given clock frequency. Clock skew can also be used to speed up circuits using a technique similar to retiming described by Fishburn~\cite{Fishburn}. We describe a method for combining this technique which uses added clock skew with retiming to optimize the performance of edge-clocked circuits. TR # 93-05-05 "Training Compilers to Make Better Inlining Decisions" Jeffrey Dean and Craig Chambers Optimizing implementations for object-oriented languages rely on aggressive inlining to achieve good performance. Sometimes the compiler is over-eager in its quest for good performance, however, and inlines too many methods that merely increase compile time and consume extra compiled code space with little benefit in run-time performance. We have designed and implemented a new approach to inlining decision making in which the compiler performs inlining experimentally and records the results in a database that can be consulted to guide future inlining decisions of the same routine at call sites that have similar static information. Central to our approach is a new technique, call type group analysis, that calculates how much of the static information available at a call site was profitably used during inlining. The results of type group analysis enable the compiler to compute a generalization of the actual static information for a particular experiment, significantly increasing reuse of database entries. Preliminary results indicate that compile time is almost cut in half with only at 15% reduction in run-time performance. TR # 93-05-07 "Asynchronous Design Methodologies: An Overview" Scott Hauck Asynchronous design has been an active area of research since at least the mid 1950's, but has yet to achieve widespread use. We examine the benefits and problems inherent in asynchronous computations, and in some of the more notable design methodologies. These include Huffman asynchronous circuits, burst-mode circuits, Micropipelines, template-based and Trace Theory-based delay-insensitive circuits, Signal Transition Graphs, Change Diagrams, and compilation-based quasi-delay-insensitive circuits. TR # 93-05-08 "Pixel Arithmetic in Mathematics Education Using Image Processing Steve Tanimoto, James R. King, and Dennis Lee A major challenge to mathematics educators for teenage students nowadays is motivating them to pay attention to the subject. We recently began a program of software development and testing which aims to interest 14-year-old students in mathematical activities. Our vehicle is image processing on the personal computer. A central aspect of our approach is the use of ``pixel arithmetic'' as a computational system with its own semantics which is related to, but distinct from that of real-number arithmetic. In this article, the necessity for pixel arithmetic is explained, together with insights about the pedagogical context in which it arises.