DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF WASHINGTON TECHNICAL REPORT ABSTRACTS - 94-06-01 -> 95-02-03 Obtaining Electronic Technical Reports: Technical reports in this list with an asterisk (*) following the TR number are available on-line as of April 1995 via FTP (ftp.cs.washington.edu), or the WWW (http://www.cs.washington.edu/research/tr/techreports.html) Obtaining Hardcopy Technical Reports: To obtain a technical report on paper, sent to you via Snail Mail, please contact . If you need additional information or assistance, please contact: ABSTRACTS TR # 94-06-01 * Hugues Hoppe "Surface Reconstruction from Unorganized Points" This thesis describes a general method for automatic reconstruction of accurate, concise, piecewise smooth surfaces from unorganized 3D points. Instances of surface reconstruction arise in numerous scientific and engineering applications, including reverse-engineering - the automatic generation of CAD models from physical objects. Previous surface reconstruction methods have typically required additional knowledge, such as structure in the data, known surface genus, or orientation information. In contrast, the method outlined in this thesis requires only the 3D coordinates of the data points. From the data, the method is able to automatically infer the topological type of the surface, its geometry, and the presence and location of features such as boundaries, creases, and corners. The reconstruction method has three major phases: 1) initial surface estimation, 2) mesh optimization, and 3) piecewise smooth surface optimization. A key ingredient in phase 3, and another principal contribution of this thesis, is the introduction of a new class of piecewise smooth representations based on subdivisions. The effectiveness of the three-phase reconstruction method is demonstrated on a number of examples using both simulated and real data. Phases 2 and 3 of the surface reconstruction method can also be used to approximate existing surface models. By casting surface approximation as a global optimization problem with an energy function that directly measures deviation of the approximation from the original surface, models are obtained that exhibit excellent accuracy to conciseness tradeoffs. Examples of piecewise linear and piecewise smooth approximations are generated for various surfaces, including meshes, NURBS surfaces, CSG models, and implicit surfaces. TR # 94-06-02 * "Experiences with the UWTester in Computer Science and Engineering Education" Neil R. McKenzie, Carl Ebeling, Larry McMurchie, Gaetano Borriello Teaching students the complete process of circuit design, simulation, implementation, test and debug is a daunting task. Even though design description tools and circuit compilers have kept up with the increasing levels of integration found in current implementation media such as FPGAs and microcontrollers, it has become increasingly difficult for students to test and debug the complex hardware projects enabled by these modern tools and devices. In this paper we describe an inexpensive digital circuit testing environment that greatly simplifies the test and debug experience. This environment, consisting of the tester hardware and its corresponding software, enables students to experience the challenges of testing and debugging without the expense of costly commercial hardware testers or the drudgery of setting up ad hoc hardware test jigs. The UWTester has been in use in the Department of Computer Science and Engineering at the University of Washington for the past two years in both undergraduate and graduate hardware design courses as well as independent study and research projects. TR # 94-06-03 * "Exploiting Shared Memory for Protected Services" Rene W. Schmidt The client/server model is commonly used to provide sharing and protection of data. Using this model, distrusted applications (clients) can access and possibly modify shared data, while guaranteeing data integrity. To ensure this data integrity, the shared data is managed by a server that is located in a private protection domain. Calls to the server domain utilize a protected procedure call mechanism, i.e., the calls execute within the server's domain. While this model provides safety, it can suffer in performance, due to the domain-crossing costs required by the protected procedure calls. "Porc" is a toolkit for building object-based client/server applications. Its main goals are (1) to provide efficient access to protected objects and (2) to hide the protection boundary from clients. In Porc protected server objects are represented as local proxy objects in the client; the difference between a local object and a protected object is thus transparent to the client. This thesis describes three extensions of the Porc toolkit in order to provide fast access to protected objects. The extensions are all based on read-only sharing. First, read-only methods must be defined, which are server procedures that do not modify server state. Read-only methods can be directly invoked by clients without requiring a protected procedure call. Second, version-based synchronization is needed; a mechanism, which can be used to efficiently synchronize read-only methods. Third, shared proxies are used to avoid the cost of local proxy creation in clients, making it faster to traverse complex pointer structures. Shared proxies also increase transparency by reducing pointer aliasing problems. TR # 94-06-04 * "Automatic Synthesis of Device Drivers for Hardware/Software Co-design" Elizabeth Walkup, Gaetano Borriello Automatically synthesizing device drivers, the hardware and software needed to interface a device to a processor, is an important element of hardware/software co-design. Driver software consists of the sequences of instructions needed for the processor to control its interactions with the device. Driver hardware consists of the digital logic necessary to physically connect the devices and generate signal events while meeting timing constraints. We describes an approach that begins with device specifications in the form of timing and state diagrams and determines which signals can be controlled directly from software and which require indirect control through intervening hardware. Minimizing this hardware requires solving a simultaneous scheduling and partitioning problem whose goal is to limit the number of wires whose events are not directly generated by the processor software. We show that even the simplest version of this problem is NP-hard and discuss a heuristic solution that should work well in practical situations. TR # 94-06-05 * "Lecture Notes on Message Routing in Parallel Machines" Martin Tompa These are the lecture notes from CSE 522, a graduate algorithms course I taught at the University of Washington in Spring 1994. The topic of the course was Message Routing in Parallel Machines. These notes are not intended to be a survey of that area, however, as there are numerous important results that I would have liked to cover but did not have time. TR # 94-06-06 * "The Meerkat Multicomputer: Tradeoffs in Multicomputer Architecture" Robert C. Bedichek A central problem preventing the wide application of distributed memory multicomputers has been their high price, especially for small installations. High prices are due to long design times, support for scaling to thousands of nodes, and high production costs. This thesis demonstrates a new approach that combines some carefully chosen restrictions on scaling with a software-intensive methodology. We used a concrete design, Meerkat, as a vehicle for exploring the multicomputer design space. Meerkat is a distributed memory multicomputer architecture effective in the range from several to hundreds of processors. It uses a two-dimensional, passive backplane to connect nodes composed of processors, memory, and I/O devices. The interconect is conceptually simple, inexpensive to design and build, has low latency, and provides high bandwidth on long messages. However, it does not scale to thousands of processors, provide high bandwidth on short messages, or implement cache coherent shared memory. This thesis describes Meerkat's architecture, the niche that Meerkat fills, and the rationale for our design choices. We present performance results obtained from our hardware prototype and a calibrated simulator to demonstrate that parallel numerical workloads work well on Meerkat. TR # 94-07-02 * "System Support for Efficient Network Communication" Chandramohan A. Thekkath Recent advances in processor and network technology have significantly changed the hardware base for distributed systems. However, high-speed networks and fast processors alone are not sufficient to deliver the end-to-end performance that users of these systems expect; good software design is critical. This thesis explores software mechanisms needed to support efficient network access for the next generation of workstations and networks. The thesis proposes a new communication mechanism or model based on remote access (read and write) to protected memory segments. This model has two key aspects. First, it simplifies input data demultiplexing and reduces data copying costs. This allows users of the model very efficient access to remote data. A second and more significant feature of the model is that it separates the transfer of data from the transfer of control. That is, a remote data access can complete at the target process without invoking a thread of control within the target. The communication model is implemented on DECstation workstations connected by an ATM network and provides performance close to the limits imposed by the hardware. Using example applications, we show how the model can provide excellent performance for both parallel applications and conventional distributed system mechanisms like remote procedure call (RPC). However, we argue that conventional communication mechanisms, e.g., message passing or RPC, do not fully exploit the potential of newer networks. These mechanisms closely couple the transfer of both control and data. We demonstrate, using the read/write communication model, that separating data transfer and control transfer in distributed systems can both (1) eliminate unnecessary and expensive control transfers and (2) facilitate tighter coupling of a distributed system. This has the potential to increase performance and reduce server load, which supports scaling in the face of an increasing number of clients. TR # 94-07-03 * "Minimal Adaptive Routing on the Mesh with Bounded Queue Size" Donald D. Chinn, Tom Leighton, Martin Tompa An adaptive routing algorithm is one in which the path a packet takes from its source to its destination may depend on other packets it encounters. Such algorithms potentially avoid network bottlenecks by routing packets around "hot spots." Minimal adaptive routing algorithms have the additional advantage that the path each packet takes is a shortest one. For a large class of minimal adaptive routing algorithms, we present an Omega(n^2/k^2) bound on the worst case time to route a static permutation of packets on a n x n mesh or torus with nodes that can hold up to k >= 1 packets each. This is the first nontrivial lower bound on adaptive routing algorithms. The argument extends to more general routing problems, such as the h-h routing problem. It also extends to a large class of dimension order routing algorithms, yielding an Omega(n^2/k^2) time bound. To complement these lower bounds, we present two upper bounds. One is an O(n^2/k^2) time dimension order routing algorithm that matches the lower bound. The other is the first instance of a minimal adaptive routing algorithm that achieves O(n) time with constant sized queues per node. We point out why the latter algorithm is outside the model of our lower bounds. TR # 94-07-04 * "Separating Data and Control Transfer in Distributed Operating Systems" Chandramohan A. Thekkath, Henry M. Levy, Edward D. Lazowska Advances in processor architecture and technology have resulted in workstations in the 100+ MIPS range. As well, newer local-area networks such as ATM promise a ten- to hundred-fold increase in throughput, much reduced latency, greater scalability, and greatly increased reliability, when compared to current LANs such as Ethernet. We believe that these new network and processor technologies will permit tighter coupling of distributed systems at the hardware level, and that distributed systems software should be designed to benefit from that tighter coupling. In this paper, we propose an alternative way of structuring distributed systems that takes advantage of a communication model based on remote network access (reads and writes) to protected memory segments. A key feature of the new structure, directly supported by the communication model, is the separation of "data transfer" and "control transfer." This is in contrast to the structure of traditional distributed systems, which are typically organized using message passing or remote procedure call (RPC). In RPC-style systems, data and control are inextricably linked -- all RPCs must transfer both data and control, even if the control transfer is unnecessary. We have implemented our model on DECstation hardware connected by an ATM network. We demonstrate how separating data transfer and control transfer can eliminate unnecessary control transfers and facilitate tighter coupling of the client and server. This has the potential to increase performance and reduce server load, which supports scaling in the face of an increasing number of clients. For example, for a small set of file server operations, our analysis shows a 50% decrease in server load when we switched from a communications mechanism requiring both control transfer and data transfer, to an alternative structure based on pure data transfer. TR # 94-07-05 * "Hardware and Software Support for Efficient Exception Handling" Chandramohan A. Thekkath and Henry M. Levy Program-synchronous exceptions, for example, breakpoints, watchpoints, illegal opcodes, and memory access violations, provide information about "exceptional" conditions, interrupting the program and vectoring to an operating system handler. Over the last decade, however, programs and run-time systems have increasingly employed these mechanisms as a performance optimization to detect "normal" and "expected" conditions. Unfortunately, current architecture and operating system structures are designed for exceptional or erroneous conditions, where performance is of secondary importance, rather than normal conditions. Consequently, this has limited the practicality of such hardware-based detection mechanisms. We propose both hardware and software structures that permit efficient handling of synchronous exceptions by user-level code. We demonstrate a software implementation that reduces exception-delivery cost by order-of-magnitude on current RISC processors, and show the performance benefits of that mechanism for several example applications. TR # 94-07-06 "Self-Explanatory Simulation for an Electronic Encyclopedia" Franz G. Amador This dissertation presents a new, interpretive architecture for self-explanatory simulation that is designed to support highly interactive, simulation-based electronic encyclopedia articles. Unlike previous compilation-based architectures, this interpretive architecture can rapidly respond to external modifications to the simulated model, such as a user's disconnecting the components of a simulated refrigerator while interacting with an encyclopedia article on refrigeration. This architecture is fully implemented as the Pika self-explanatory simulator, which prepares for simulation the most complex reported test example of Forbus and Falkenhainer's SIMGEN .Mk2, the state-of-the-art self-explanatory simulator compiler, about 2000 times faster than does SIMGEN .Mk2. Once simulation has begun, Pika employs an incremental update strategy that, for a realistic thermodymatic model, reduces the time needed to respond to changes in the model, internal or external, by a factor of eight over the initial simulation-preparation time. Unlike SIMGEN, Pika's modeling language allows non-directional algebraic and differential equations. Through novel use of a constraint hierarchy, Pika identifies state variables, chooses default quantity values, decides how to use each equation, and symbolically solves equations as necessary to prepare for numeric integration. TR # 94-07-07 * "Implementing Constraint Imperative Programming Languages: the Kaleidoscope'93 Virtual Machine" Gus Lopez, Bjorn Freeman-Benson, and Alan Borning Constraint Imperative Programming (CIP) languages integrate declarative constraints with imperative state and destructive assignment, yielding a powerful new programming paradigm. However, CIP languages are difficult to implement efficiently due to complex interactions between the two donor paradigms. Neither the virtual machines for classical object-oriented languages, nor those for existing constraint languages, are suitable for implementing CIP languages, as each assumes a purely imperative or a purely declarative computation model. We have developed a new virtual machine for CIP languages, the K-machine, an imperative machine with an incremental constraint solver and a constraint-based, rather than value-based, data store. This virtual machine allows user-defined constraints to be defined using constraint constructor definitions which are the CIP analog to method definitions. Similar to methods, these constructors are able to reference variables indirectly through many levels of pointers. The K-machine maintains relations between objects in the presence of state change to these indirectly referenced objects. The K-machine is capable of supporting a wide variety of COP languages, including our most recent: Kaleidoscope'93. TR # 94-08-01 * "Mediators: Easing the Design and Evolution of Integrated Systems" Kevin J. Sullivan People benefit from tightly integrated systems that can be designed, realized, and evolved affordably. Unfortunately, common software design methods do not easily accommodate requirements for tightly integrated systems. Indeed, when used to meet such requirements, common methods tend to yield unnecessarily complex structures that complicate design and realization and that inhibit subsequent evolution. After substantiating this claim, I present the "mediator method" as a solution. This method combines "behavioral entity-relationship (ER) modeling" for design with a "mediator" approach to implementation. The mediator method is better than common methods for designing, realizing, and evolving many integrated systems. I support this claim both by rational argument from simplyfying models and by carefully reflecting on software development experiences in which the method was used. One of these efforts led to the design of Prism, a system used to plan radiation treatments for cancer patients, now in clinical use at the University of Washington Cancer Center. I present Prism as a case study on the use of the mediator design method. The mediator method represents a contribution of significant value because it allows ordinary software developers to design, realize, and evolve more effective integrated systems more affordably with common programming languages and without costly or restrictive new mechanisms. TR # 94-09-01 * "The PRESTO Application Suite" Radhika Thekkath and Susan J. Eggers This report describes a group of coarse- and medium-grain, explicitly-parallel applications that have been made available via the World-Wide Web. These programs have been written using the PRESTO user-level threads library, by students at the University of Washington and Rice University. PRESTO provides a C++ based environment for writing object-oriented parallel programs for shared-memory multiprocessors. The library provides basic classes useful for writing parallel programs, among them are thread manipulation routines for concurrency and synchronization primitives. TR # 94-09-02 * "Optimizing Static Calendar Queues" K. Bruce Erickson, Richard E. Ladner, Anthony LaMarca The calendar queue is an important implementation of a priority queue which is particularly useful in discrete event simulators. In this paper we present an analysis of the static calendar queue which maintains N active events. A step of the discrete event simulator removes and processes the event with the smallest associated time and inserts a new event whose associated time is the time of the removed event plus a random increment with mean "mu." We demonstrate that for the infinite bucket calendar queue the optimal bucket width is approximately delta_{opt} = sqrt(2b/c) (mu/N) where b is the time to process an empty bucket and c the incremental time to process a list element. With bucket width chosen to be delta_{opt}, the expected time to process an event is approximately minimized at the constant c + sqrt(2bc) + d, where d is the fixed time to process an event. We show that choosing the number of buckets to be O(N) yields a calendar queue with performance equal to or almost equal to the performance of the infinite bucket calendar queue. TR # 94-09-03 * "Reflecting Source Code Relations in Higher-Level Models of Software Systems" Gail C. Murphy, David Notkin, and Kevin Sullivan Reasoning about an existing software system in terms of high-level models - such as entity-relationship diagrams, data-flow diagrams, object communication diagrams, etc. - is both necessary and dangerous. The necessity arises because the source code for a system is so complex that no individual can fully understand it. The danger arises from the risk that a software engineer may make an inappropriate decision due to discrepancies between the source and a high-level model. We present a formal model, algorithm and tool that reduce the risk of reasoning about the system in terms of a high-level model when performing a software engineering task. Given a high-level model posited by an engineer, a low-level model of the source (e.g., a call graph), and a mapping between the two, we compute and report both the consistencies and inconsistencies between the high-level and source models. Reported consistencies increase the engineer's belief that reasoning in terms of the high-level model is trustworthy. Inconsistencies warn the engineer to consider carefully whether additional analysis is needed before continuing with the planned activity. TR # 94-09-04 * "Scheduling Issues in the Co-Synthesis of Reactive Real-Time Systems" Pai Chou, Elizabeth Walkup, Gaetano Borriello Many embedded control applications must respect intricate timing requirements on their interactions with the external environment. These constraints are derived from response time, rate of execution, and low-level signaling requirements. Currently, most of these systems are being designed in an ad hoc manner. Many tools assume the designer has already finalized the scheduling, while most schedulers make simplifying assumptions and often cannot handle general timing constraints. In this paper, we discuss the scheduling issues that must be addresses by co-synthesis tools for embedded systems and outline possible approaches to the problems. Our perspective is based on experience with Chinook, a hardware-software co-synthesis system for reactive real-time systems, currently under development at the University of Washington. Chinook is initially targeting embedded applications without operating system support. From a high-level specification and a device library, Chinook synthesizes both interface hardware and a software program to realize the design. TR # 94-09-05 * "Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations" Tor Jeremiassen and Susan Eggers We have developed compiler algorithms that analyze coarse-grained, explicitly parallel programs and restructure their shared data to minimize the number of false sharing misses. The algorithms analyze the per-process data accesses to shared data, use this information to pinpoint the data structures that are prone to false sharing, and choose an appropriate transformation to reduce it. The algorithms eliminated an average (across the entire workload) of 64% of false sharing misses, and in two programs more than 90%. However, how well the reduction in false sharing misses translated into improved execution time depended heavily on the memory subsystem architecture and previous programmer efforts to optimize for locality. On a multiprocessor with a large cache configuration and high cache miss penalty, the transformations improved the execution time of programmer-unoptimized applications by as much as 60%. However, on programs where previous programmer efforts to improve data locality had reduced the original amount of false sharing, and on a multiprocessor with a small cache configuration and cache miss penalty, the gains were more modest. TR # 94-09-07 * "A Framework for Selective Recompilation in the Presence of Complex Intermodule Dependencies" Compilers and other programming environment tools derive information from the source code of programs; derived information includes compiled code, interprocedural summary information, and call graph views. If the source program changes, the derived information needs to be updated. We present a simple framework for maintaining intermodule dependencies, embodying different tradeoffs in terms of space usage, speed of processing, and selectivity of invalidation, that eases the implementation of incremental update of derived information. Our framework augments a directed acyclic graph representation of dependencies with factoring nodes (to save space) and filtering nodes (to increase selectivity), and it includes an algorithm for efficient invalidation processing. We show how several schemes for selective recompilation, such as smart recompilation, filter sets for interprocedural summary information, and dependencies for whole-program optimization of object-oriented languages, map naturally onto our framework. For this latter application, by exploiting the facilities of our framework, we are able to reduce the number of lines of source code recompiled by a factor of seven over a header file-based scheme, and by a factor of two over the previous state-of-the-art selective dependency mechanism without consuming additional space. TR # 94-09-08 "Efficient Replication Management in Distributed Systems" Michael Rabinovich Replication is a critical aspect of large-scale distributed systems. Without it, the size of a system is limited by factors such as the risk of component failures, the overloading of popular services, and access latency to remote parts of the system. Replication overcomes these problems by allowing service to continue despite failures using remaining replicas, and by distributing requests for a given service among multiple server replicas. However, replicated systems incur significant performance overhead for maintaining multiple replicas and keeping them mutually consistent. Managing replication efficiently is therefore important in building large-scale distributed systems. This dissertation concentrates on quorum-based replication management. It proposes several ways to manage replication efficiently using different types of quorums. First, we study structure-based quorums, which are attractive because they result in low-overhead replica management when the number of replicas is high. We propose a way to significantly improve system availability in protocols using these quorums. We also study in depth the performance of a particular class of these quorums based on a grid structure. Second, for voting-based quorums, we propose a way to include new or repaired replicas into the system and exclude failed, disconnected, or deactivated replicas, asynchronously with user operations. Thus, such reconfiguration does not involve any service interruption and can be done more freely for various purposes, e.g., improving data availability, migrating data, or redistributing load. Third, we propose a way to efficiently manage read-one-write-all (ROWA) replication. Protocols based on ROWA quorums maintain consistency by reading one copy of the data and writing all copies. They are especially attractive because they provide for highly efficient and fault-tolerant read operation, and because many services and data in distributed systems fall under the mostly-read category. The obvious weakness of ROWA, which has prevented it from being widely used in commercial wide-area networks, is lack of fault-tolerance for writes. This thesis proposes a new ROWA protocol that provides fault-tolerant write operations without any significant deterioration of read properties. This protocol offers immediate benefits to current systems. Finally, the thesis studies some issues that are common for all types of quorums, including semantics of write operations and efficient protocols for transaction commitment. TR # 94-09-09 * "Architectural Support for Compiler-Generated Data-Parallel Programs" Alexander C. Klaiber To fully realize the advantages of parallel processing demands the design of efficient communication mechanisms. Existing communication architectures span a spectrum ranging from message passing to remote-memory access, shared memory and cache-only architectures. These communication architectures are often used (and designed to be used) directly by the programmer. However, in the future we can expect more programs to be written in high-level parallel languages and compiled to the specific parallel target; the compiler will hide the details of the underlying hardware from the programmer. The communication architecture should then be designed with the compiler, not the programmer, in mind. The goal of our work is to improve communication performance for programs that are written in a high-level parallel language and then compiled to a specific communication architecture. To make this task manageable, we focus on the class of data-parallel languages and we pick C* as one representative for our experiments. We evaluate three competing communication architectures - message-passing, remote memory access, and cache-coherent shared-memory -- for a set of benchmarks written in C* and compiled to the respective architecture. We show that the message-passing model has several inherent advantages for these benchmarks, resulting in less interconnect traffic and less time spent waiting for messages to traverse the interconnect. On the other hand, the message-passing architecture requires the CPU to perform significantly more work per message than the other architectures. This communication overhead destroys much of the message passing model's advantage. We propose a language-oriented communication architecture that retains the advantages of the message-passing model, yet (in cooperation with the compiler) significantly reduces the communication overhead. To do so, we first identify a small set of low-level communication and synchronization primitives that are well matched to the needs of C*. We then design a network interface that is tuned to these primitives and describe the C* compilation for this base; our network interface includes hardware for remote read/write requests plus counter-based synchronization support. We simulate and measure our compiled C* benchmarks on a traditional message-passing interface as well as our language-oriented design; our measurements demonstrate that our design is effective at reducing communication-related CPU overhead. TR # 94-09-10 "Constraint Satisfaction and Debugging for Interactive User Interfaces" Michael Sannella Many user interface toolkits use constraint solvers to maintain geometric relationships between graphic objects, or to connect the graphics to the application data structures. One efficient and flexible technique for maintaining constraint is multi-way local propagation, where constraints are represented by sets of method procedures. This dissertation examines the use of local propagation constraint solvers in user interface toolkits, and presents three new systems: (1) The SkyBlue constraint solver. SkyBlue is an incremental constraint solver that uses local propagation to maintain a set of constraints as individual constraints are added and removed. If all of the constraints cannot be satisfied, SkyBlue leaves weaker constraints unsatisfied in order to satisfy stronger constraints (maintaining a constraint hierarchy). SkyBlue is a more general successor to the DeltaBlue algorithm that satisfies cycles of methods by calling external cycle solvers and supports multi-output methods. (2) The Multi-Garnet user interface development system. Garnet is a user interface toolkit that incorporates a constraint solver to maintain one-way constraints between object fields. The Multi-Garnet package extends Garnet to support multi-way constraints and constraint hierarchies. MultiGarnet has been used to construct complex user interfaces that would have been difficult to construct just using Garnet's constraint solver. (3) The CNV user interface builder and debugger. CNV includes a set of debugging tools to help programmers understand the behavior of complex constraint networks. One tool uses a new algorithm to generate alternate constraint solutions that would have been produced if SkyBlue had chosen different methods to satisfy the constraints. TR # 94-09-11 * "Wavelets for Computer Graphics: A Primer" Eric J. Stollnitz, Tony D. DeRose, David H. Salesin Wavelets are a mathematical tool for hierarchically decomposing functions. Wavelets allow any function to be described in terms of a coarse overall shape, plus details that range from broad to narrow. Regardless of whether the function of interest is an image, a curve, or a surface, wavelets provide an elegant technique for representing the levels of detail present. This primer is intended to provide those working in computer graphics with some intuition for what wavelets are, as well as to present the mathematical foundations necessary for studying and using them. We include as examples two applications of wavelets: image compression, and multiresolution editing of curves and surfaces. TR # 94-09-12 * "Integrating Coherency and Recoverability in Distributed Systems" Michael J. Feeley, Jeffrey S.Chase, Vivek R. Narasayya, and Henry M. Levy We propose a technique for maintaining coherency of a "transactional" distributed shared memory, used by applications accessing a shared persistent store. Our goal is to improve support for fine-grained distributed data sharing in collaborative design applications, such as CAD systems and software development environments. In contrast, traditional research in distributed shared memory has focused on supporting parallel programs; in this paper, we show how distributed programs can benefit form this shared-memory abstraction as well. Our approach, called log-based coherency, integrates coherency support with a standard mechanism for ensuring recoverability of persistent data. In our system, transaction logs are the basis of both recoverability and coherency. We have prototyped log-based coherency as a set of extensions to RVM (Satyanarayanan et al. '94), a runtime package supporting recoverable virtual memory. Our prototype adds coherency support to RVM in a simple way that does not require changes to existing RVM applications. We report on our prototype and its performance, and discuss its relationship to other DSM systems. TR # 94-10-01 * "Global Illumination of Glossy Environments using Wavelets and Importance" Per Christensen, Eric Stollnitz, David Salesin, Tony DeRose We show how importance-driven refinement and a wavelet basis can be combined to provide an efficient solution to the global illumination problem with glossy and diffuse reflections. Importance is used to focus the computation on the interactions having the greatest impact on the visible solution. Wavelets are used to provide an efficient representation of radiance, importance, and the transport operator. We discuss a number of choices that must be made when constructing an algorithm for a finite-element solution method for glossy global illumination. The main contributions of this paper are a four-dimensional wavelet representation for spatially- and angularly-varying radiance distributions (which allows for sparse representation and fast evaluation), the use of the standard wavelet decomposition of the transport operator, and the formulation of a type of importance suited for light transports using a finite element radiance representation. Our implementation supports curved surfaces and spatially-varying anisotropic BRDFs. We use a final gathering step to improve the visual quality of the solution. TR # 94-10-02 "N.W. Laboratory for Integrated Systems Semiannual Technical Report #6" Gaetano Borriello, Steven M. Burns, Carl Ebeling, Lawrence Snyder The goal of our work is to develop synthesis tools for high-performance custom VLSI and embedded systems. Our focus is on applications in real-time control and communications. The major elements of our research include: - Specification methods for timing constraints and delay information. - Timing analysis and verification algorithms. - Timing-driven synthesis techniques for hardware and software. - Tools for the analysis of interfaces and synthesis of glue logic and software drivers. - Infrastructure for the rapid-prototyping of board-level designs. - Implementation of real systems to ground the CAD work. TR# 94-10-03 * "Optimistic Trace-driven Simulation" Xiaohan Qin and Jean-Loup Baer Parallel simulation of multiprocessor architectures is a promising direction because a parallel system provides the high computation and storage capabilities that are required by detailed architectural simulation. Additionally, the behavior of the target system exhibits natural parallelism. In this paper, we consider the evaluation of the memory hierarchy of multiprocessor systems via parallel trace-driven simulation. We present a Time Warp-like parallel trace-driven simulation algorithm and data structures. The overhead components of the optimistic algorithm as well as a number of optimization strategies are discussed. The performance of the optimistic parallel simulator as implemented on a KSR-2 is reported. Our results show that significant speedup can be achieved for applications which have good data locality. As the amount of shared reference misses increases, the frequent communication and synchronization overhead inherent in the applications limits also the speedup of the parallel trace-driven simulation. TR # 94-10-04 * "A Comparative Study of Conservative and Optimistic Trace-driven Simulation" Xiaohan Qin and Jean-Loup Baer In this paper, we consider the evaluation of the memory hierarchy of multiprocessor systems via parallel trace-driven simulation. We study two parallel simulation schemes: a conservative one using an algorithm proposed by Lin et al., whose main characteristic is to insert the shared references from every trace in all other traces, and an optimistic one using a Time Warp-like algorithm. We compare, qualitatively and quantitatively, the major causes of overhead and the overall performance of the two methods. In addition, we discuss the trade-offs in terms of implementation and debugging effort and of application to more general architectural simulation. The optimistic scheme is more complex but, in general, has slightly better performance, is more general, and does not require preprocessing. TR # 94-10-05 * "Scheduling Memory Constrained Jobs on Distributed Memory Parallel Computers" Cathy McCann and John Zahorjan While the parallel use of many processors is a major attraction of scalable multiprocessors for large applications, another important feature of such machines is the large amount of physical memory they make available. Despite these resources, truly large applications may be limited not only by the amount of computation they require, but also by the amount of data they must operate on to solve problems of interest. In this paper, we consider the problem of multiprocessor scheduling of jobs whose memory requirements place lower bounds on the fraction of the machine required in order to execute. We address three primary questions in this work: 1. How can a parallel machine be multiprogrammed with minimal overhead when jobs have minimum memory requirements? 2. To what extent does the inability of an application to repartition its workload during runtime affect the choice of processor allocation policy? 3. How rigid should the system be in attempting to provide equal resource allocation to each runnable job in order to minimize average response time? TR # 94-10-06 * "ZPL Language Reference Manual" Calvin Lin This document presents a complete description of the ZPL language. The language is presented without motivation in a bottom-up fashion, starting with lexical elements and then building upon these to introduce the ZPL data types, variables, expressions, and statements. A table of contents and an extensive index is found at the end. ZPL is a data parallel language that allows arrays and subarrays to be manipulated as whole entities. The language provides constructs that eliminate tedious and error prone array indexing. The language's conciseness is illustrated by the SIMPLE computational fluid dynamics benchmark, which is approximately 5000 lines when coded in C plus message passing, but only 500 lines and considerably more readable when coded in ZPL. ZPL programs have sequential semantics and control flow, which considerably simplifies the program development and debugging process. A major goal of ZPL is to provide portability and performance across diverse parallel computers. As such, ZPL is built upon the Phase Abstractions parallel programming model. TR # 94-11-01 * "Optimal One-Way Sorting on a One-Dimensional Sub-Bus Array" James D. Fix and Richard E. Ladner The problem of sorting on a one-dimensional sub-bus array of the processors is addressed. A one-dimensional sub-bus array has a bus connecting the processors which can be segmented into sub-buses on which an active processor can broadcast. The sub-bus broadcast capability is implemented on the MasPar MP-1 and MP-2 parallel computers. The sub-bus broadcast operation makes possible a new class of parallel sorting algorithms which can be characterized by "parallel insertion steps." We restrict our attention to the class of sorting methods where parallel insertion steps are in the same direction, say left. For this class of "one-way sorting strategies" we prove a lower bound and give two strategies, both of which achieve the lower bound. Because our parallel insertion model is quite general, it is not necessarily the case that a sorting strategy in the model can be efficiently implemented on a real sub-bus array of processors. The left greedy sort cannot be efficiently implemented, but the left adaptive insertion sort can be efficiently implemented. The two sorting strategies have different properties and each is interesting in its own right. TR # 94-11-02 "Wirec 3.2 Tutorial and Reference Manual" Larry McMurchie and Carl Ebeling Wirec is a system for describing the structure of electronic systems. It allows the designer to use either procedural constructs or schematic symbols and wires to describe circuit structure. The procedural constructs, in this case C++, give the power and flexibility to describe structures algorithmically. Schematic symbols and wires have the advantage of being graphical and therefore intuitive means of representing circuits. Wirec allows the designer to embed C++ code in schematics and schematics in C++ code, and therefore mix the two modes of representation at a very fine grain. TR # 94-11-03 * "A Characterization of the `Dynamic Markov Compression' FSM with Finite Conditioning Contexts" Suzanne Bunton The well-known but poorly understood "Dynamic Markov Compression" (DMC) algorithm is rigorously analyzed. DMC is one of many sophisticated on-line stochastic data models from the literature that may be combined with arithmetic coding to perform lossless data compression. We define and prove correct the first finite-context characterization of DMC model states. Our minimal characterization proves that DMC's ergodic, unifilar, finite-order Markov models do not belong to the useful class of FSMX sources. Nonetheless, our characterization enables meaningful comparison of DMC models with the many popular FSMX models in the literature in terms of model structure and a priori statistical assumptions. Lastly, the characterization exposes principled solutions for improving DMC's already competitive performance, and for reducing DMC's excessive memory consumption, which prevents DMC models from being used in practice. TR # 94-11-07 "Determination of DRAM Operating Limits with Applications in Self-Tuning" Chris Fisher, Jean-Paul Tea, Sean Mirkes, Steve Rabin, Ted Kehl Dynamic random access memory is commonly used for storing large amounts of data due to its relatively inexpensive price and its small space requirements. The purpose of this research is to investigate the possibilities for enhancing the performance of DRAM. This includes determining the minimal timing for RAS and CAS, investigating data and address dependency, and determining the amount of jitter in DRAM circuitry. TR # 94-11-08 * "Wit: An Infrastructure for Wireless Palmtop Computing" Terri Watson Wit is a system infrastructure that supports wireless connection of palmtop computers to a wired local area network (LAN). The systems includes a primitive windowing system, non-preemptive threads, reliable and unreliable connections, and a Tcl interpreter which allows applications to dynamically extend the palmtop system and user interface by downloading sequences of Tcl code. Together, these services provide an environment for exploring issues in mobile computing for small devices by supporting a wide range of solutions for hiding low bandwidth constraints, and permitting disconnected operation through local execution of application functions. TR # 94-12-01 * "Optimization of Object-Oriented Programs Using Static Class Hierarchy Analysis" Jeffrey Dean, David Grove, and Craig Chambers Optimizing compilers for object-oriented languages apply static analysis and other techniques to try to deduce precise information about the possible classes of the receivers of messages; if successful, dynamically-dispatched messages can be replaced with direct procedure calls and potentially further optimized through inline-expansion. By examining the complete inheritance graph of a program, which we call "class hierarchy analysis," the compiler can improve the quality of static class information and thereby improve run-time performance. In this paper we present class hierarchy analysis and describe techniques for implementing this analysis effectively in both statically- and dynamically-typed languages and also in the presence of multi-methods. We also discuss how class hierarchy analysis can be supported in an interactive programming environment and, to some extent, in the presence of separate compilation. Finally, we assess the bottom-line-performance improvement due to class hierarchy analysis alone and in combination with two other "competing" optimizations, profile-guided receiver class prediction and method specialization. TR # 94-12-02 "AZPL Programming Guide" Lawrence Snyder This guide seeks to be a complete introduction to the ZPL programming language assuming the reader is experienced with some imperative language such as C, Fortran, Pascal, or the like. Though precise and in most instances thorough, the guide does not attempt to be a reference manual for ZPL. Rather, it illustrates typical ZPL usage and explains in an intuitive way how the constructs work. Emphasis is placed on teaching the reader to be a ZPL programmer. Scientific computations are used as examples throughout. Computer scientists familiar with the area of programming high performance parallel computers and interested in an accelerated route through the guide are advised to study Fig. 1, skim "A Quick Tour," read "ZPL Array Concepts" and "Parallel Execution - Two Important Details." TR # 94-12-03 * "Runtime Support for Dynamic Space-Based Applications on Distributed Memory Multiprocessors" Immaneni Ashok "Dynamic space-based" applications are simulations of objects moving through a closed k-dimensional space subject to mutual forces. There are a wide variety of such applications, differing in the kinds of objects and forces being simulated. These applications exhibit strong data locality patterns, but these patterns change during the computation as objects change position in the simulated space. To achieve good performance when run on distributed memory multiprocessors, two conflicting goals must be addressed: the strong spatial data locality must be exploited, and the computational load must be balanced across the processors. These optimizations can be done statically, by either the programmer or the compiler, or dynamically, by handwritten application code or a runtime system. This dissertation investigates the issues involved in designing a specialized runtime system that provides convenience of programming as well as efficient execution. First, we address the issues in designing a programming interface for dynamic space-based applications. We propose a new programming model that enables the development of parallel codes that involve very few additional lines of code and very few additional concepts beyond those required to construct a sequential version of the application. Second, we address the issues in data partitioning and dynamic load balancing. We propose heuristics that can be effectively used to determine a good partitioning scheme based on the application execution characteristsics and the machine characteristics. We propose a novel dynamic load balancing scheme that uses a "non-uniform, adaptive load discretization" method to estimate the load distribution, a "hierarchical" scheme to balance the load, and a "predictive" method to determine how often to rebalance. We show that the schemes that we propose are very effective in reducing the execution overheads due to communication, load imbalance, and load balancing. Third, we study the issues in adapting to processor reallocations performed by the operating system on a multiprogrammed system. We show that it is advantageous for the runtime system to remap the data, unless the processor allocation changes too often or there is very little time left over to finish the job. TR # 94-12-07 * "Wait Free Algorithms for Heaps" Greg Barnes This paper examines algorithms to implement heaps on shared memory multiprocessors. A natural model for these machines is an asynchronous parallel machine, in which the processors are subject to arbitrary delays. On such machines, it is desirable for algorithms to be "wait-free," meaning that each thread makes progress independent of the other threads executing on the machine. We present a wait-free algorithm to implement heaps. The algorithms are similar to the general approach given in an earlier work with optimizations that allow many threads to work on the heap simultaneously, while still guaranteeing a strong serializability property. TR # 95-01-01 * "TRIBORS: A Triplet-Based Object Recognition System" Kari Pulli This paper describes TRIBORS, a triplet-based object recognition system. Object recognition is facilitated by probability models, which are statistical data structures describing the appearance of object features when viewed from a point in a view class. The features are arranged into triplets, and the appearance of the triplets in the camera image is encoded in a 2D affine-invariant manner. The initial probability model is obtained from a set of synthesized images. Once objects in real camera images have been recognized using the initial probability model, the results of analyzing these images can be used to construct more realistic probability models. The use of these improved models results in faster object recognition and improved classifications. TR # 95-01-02 * "Multiresolution Analysis of Arbitrary Meshes" Matthias Eck, Tony DeRose, Tom Duchamp, Hugues Hoppe, Michael Lounsbery, Werner Stuetzle In computer graphics and geometric modeling, shapes are foten represented by triangular meshes. With the advent of laser scanning systems, meshes of extreme complexity are rapidly becoming commonplace. Such meshes are notoriously expensive to store, transmit, render, and are awkward to edit. Multiresolution analysis offers a simple, unified, and theoretically sound approach to dealing with these problems. A multiresolution representation of a mesh consists of a simple base mesh together with a sequence of local corrections of decreasing magnitude. By omitting small correction terms we can efficiently construct approximations with fewer triangles for compact storge and fast transmission and rendering. When editing meshes it is often desirable to modify the gross shape without affecting surface details. We can accomplish this by modifying terms in the approximation with large spatial extent. Lounsbery et al. have recently developed a technique for creating multiresolution representations for a restricted class of meshes with "subdivision connectivity." Unfortunately, meshes encountered in practice typically do not meet this requirement. In this paper we present a method for overcoming the subdivision connectivity restriction, meaning that completely arbitrary meshes can now be converted to multiresolution form. The method is based on the approximation of an arbitrary initial mesh M by a mesh M^j that has subdivision connectivity and is guaranteed to be within a specified tolerance. Multiresolution approximation of an arbitrary mesh thus proceeeds in two steps: in the first step we approximate M and M^J, and in the second step we convert M^J to a multiresolution representation using the methods of Lounsbery et al. The key ingredient of our algorithm is the construction of a parametrization of M over a simple domain. We expect this parametrization to be of use in other contexts, such as texture mapping or the approximation of complex meshes by NURBS patches. TR # 95-01-04 * "Automated Bargaining Agents (Preliminary Results)" Ying Sun and Daniel Weld Rapid commercialization of the Internet and development of the National Information Infrastructure are likely to change the nature of retailing and commerce in profound ways. These changes pose both challenges and opportunities for AI, such as how to design intelligent assistants that help humans cope with increasing complexities of electronic commerce. In this paper, we present the first steps towards creating "automated bargaining agents" -- intelligent assistants that can reason about the relative supply and demand for goods and services and negotiate to reach a good deal. We first review the economic and game theoretic background for such an endeavor. We then present a simple economic model of electronic commerce and describe several bargaining strategies. Preliminary experimental results suggest that sophisticated, adaptive strategies perform better than simple discounting approaches over a range of economic conditions. TR # 95-01-06 "Fast Multiresolution Image Querying" Charles E. Jacobs, Adam Finkelstein, David H. Salesin We present a method for searching in an image database using a query image that is similar to the intended target. The query image may be a hand-drawn sketch or a (potentially low-quality) scan of the image to be retrieved. Our searching algorithm makes use of multiresolution decompositions of the query and database images. The coefficients of these decompositions are distilled into small "signatures" for each image. We introduce an "image querying metric" that operates on these signatures, and which includes parameters that can be tuned, using a statistical analysis, to accommodate the kinds of image distortions found in different types of image queries. The resulting algorithm is simple, fast, and requires very little storage overhead for the database of signatures. Our experiments with 176 queries in a database of 1024 images show dramatic improvement, in both speed and success rate, over using an L^1 or L^2 norm. TR # 95-01-07 * "Clustering for Glossy Global Illumination" Per Christensen, Dani Lischinski, Eric Stollnitz, David Salesin We present a new clustering algorithm for global illumination in complex environments. The new algorithm extends previous work on clustering for radiosity to allow for nondiffuse (glossy) reflectors. We treat each cluster as a point source with an angular distribution of radiant intensity, and we derive an error bound for transfers between these clusters. The algorithm groups input surfaces into a hierarchy of clusters, and then permits clusters to interact only if the error bound is below an acceptable tolerance. We show that the algorithm is asymptotically more efficient than previous clustering algorithms even when restricted to ideally diffuse environments. Finally, we demonstrate a working implementation of the algorithm on a complex architectural environment. TR # 95-01-08 * "Extending the Applicability and Improving the Performance of Runtime Parallelization" Shun-Tak Leung and John Zahorjan When static analysis of a sequential loop fails to yield reliable information on its dependence structure, a parallelizing compiler is left with three alternatives: it can take the conservative option of emitting code for a sequential execution; it can optimistically emit code to speculatively execute the loop as a DOALL; or it can emit inspector-executor code to determine the actual dependence structure at runtime and to respect it in a parallel execution. The first approach is certain to yield a slow execution. The second approach gives very good results when the loop can in fact be executed as a DOALL, but is of no help otherwise. In this paper we concentrate on the final approach, runtime parallelization through the inspector-executor method. We have two goals in this work. The first is to expand the class of loop to which the approach may be applied by removing restrictions on the loop dependence structures that it can handle. To achieve this goal, we introduce new forms of the inspector and executor that together remove all restrictions on the loop dependence structure. Thus, we show how to parallelize a class of loop that previously would have compelled the compiler to emit sequential code. Our second goal is to improve the performance of the inspector-executor approach through specializations applicable when static analysis yields some (weak) information about the array indexing functions used in assignments. We validate our work through a set of examples designed to illustrate the fundamental performance tradeoffs characterizing the specialized implementations, using results taken from executions on 32 processors of a KSR1. TR # 95-02-01 * "Implementing Lighweight Remote Procedure Calls in the Mach 3 Operating System" Virgil Bourassa and John Zahorjan The Mach 3 operating system makes extensive use of remote procedure calls (RPCs) to provide services to client applications. Although the existing Mach 3 RPC facility is highly optimized, the incorporation of a Lightweight Remote Procedure Call (LRPC) facility further reduces this critical cost. This paper describes the implementation of LRPC in the Mach 3 operating system, reviewing the original LRPC concept and implementation in the TAOS operating system, the issues involved with the Mach 3 microkernel operating system, and the resulting design of LRPC_Mach3. Performance measurements indicate that the resulting implementation provides a 31% reduction in latency for a minimal RPC, with even more significant benefits for more complicated RPCs. TR # 95-02-02 * "Sound and Efficient Closed-World Reasoning for Planning" Oren Etzioni, Keith Golden, and Daniel Weld Closed-world reasoning is the process of inferring that a logical sentence is false based on its absence from a knowledge base, or the inability to derive it. Previous work on circumscription, autoepistemic logic, and database theory has explored logical axiomatizations of closed-world reasoning, and investigated computational tractability for propositional theories. Work in planning has traditionally made the closed-world "assumption" but has avoided closed-world reasoning. We take a middle position, and describe a novel method for closed-world reasoning over the first-order theories of action used by planning algorithms such as NONLIN, TWEAK, and UCPOP. We show the method to be both sound and efficient. The method has been incorporated in to the XII planner, which supports our Internet softbot (software robot). In our experiments, closed-world inference consistently averaged about 2 milliseconds, while updates averaged approximately 1.2 milliseconds. TR # 95-02-03 * "Lightweight Source Model Extraction" Gail Murphy and David Notkin Reverse engineers depend on the automatic extraction of information from source code. Some useful kinds of information -- source models -- are well-known: call graphs, file dependences, etc. Predicting every kind of potentially useful information is impossible. We have developed a lightweight approach for generating flexible and tolerant source model extractors from lexical specifications. The approach is lightweight in that the specifications are relatively small and easy to write. It is flexible in that there are few constraints on the information in the source that can be extracted (e.g., we can extract from macros, comments, etc.). It is tolerant if the information can be extracted from source that cannot necessarily be compiled or linked. In essence, we scan for source constructs that contribute to the specified source model while ignoring constructs that do not contribute to that source model. We have developed tools to support this approach and applied the tools to the extraction of a number of different source models (file dependences, event interactions, call graphs) from multiple source code formats (C, C++, CLOS). We discuss our approach and describe its application to extract source models not available using existing systems; for example, we compute the invoked-by relation over Filed tools. We compare and contrast our approach to the conventional approach of generating source models from a program database.