DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF WASHINGTON TECHNICAL REPORT ABSTRACTS - 95-05-02, 95-08-01 -> 96-01-03 Obtaining Electronic Technical Reports: Technical reports in this list with an asterisk (*) following the TR number are available on-line as of January 1996 via FTP (ftp.cs.washington.edu), or the WWW (http://www.cs.washington.edu/research/tr/techreports.html) Obtaining Hardcopy Technical Reports: To obtain a technical report on paper, sent to you via Snail Mail, please contact . If you need additional information or assistance, please contact: ABSTRACTS TR # 95-05-02 * Bradford Chamberlain, Tony DeRose, Dani Lischinski, David Salesin, and John Snyder "Fast Rendering of Complex Environments Using a Spatial Hierarchy" This paper presents work in progress on an automated method for accelerating the rendering of complex static scenes. The technique is general in that it is applicable to unstructured scenes consisting of arbitrary geometric primitives without relying on the presence of specific occlusive qualities in the scene to achieve its speed. Our approach is to place a spatially hierarchical decomposition over the scene and to construct a simplified representation of each cell in the hierarchy that approximates the general appearance of its contents. The scene is then rendered using a traversal of the hierarchy in which a cell's approximation is drawn if it is sufficiently accurate. We apply the method to several scenes, present performance results, and discuss artifacts caused by the approximation. Additionally, we address limitations in the current implementation and present our plans to circumvent these problems. TR # 95-08-01 * Gail C. Murphy, David Notkin and Erica S.-C Lan "An Empirical Study of Static Call Graph Extractors" Informally, a call graph represents calls between entities in a given program. Optimizing compilers often compute call graphs to determine the applicability of many transformations. These call graphs must typically be conservative: a call may be omitted only if it can never occur in any execution of the program. Numerous software engineering tools also extract call graphs with the expectation that they will help software engineers increase their understanding of a program. The requirements placed on software engineering tools when computing call graphs are typically more relaxed than for compilers. For example, some false negatives---calls that can in fact take place in some execution of the program, but which are omitted from the call graph---may be acceptable, depending on the understanding task at hand. In this paper we empirically show a consequence of this spectrum of requirements by comparing the C call graphs extracted from three software systems (mapmaker, mosaic, and gcc) by several extraction tools (cflow, CIA, Field, mkfunctmap, and rigiparse). A quantitative analysis of the call graphs extracted for each system shows considerable variation, a result that is counterintuitive to many experienced software engineers. A qualitative analysis of these results reveals a number of reasons for this variation: differing treatments of macros, function pointers, input formats, etc. In this paper, we describe and discuss the study, sketch the design space, and discuss the impact of our study on practitioners, tool developers, and researchers. TR # 95-08-07 * Michael J. Feeley, William E. Morgan, Frederic H. Pighin, Anna R. Karlin, Henry M. Levy, and Chandramohan A. Thekkath "Implementing Global Memory Management in a Workstation Cluster" Advances in network and processor technology have greatly changed the communication and computational power of local-area workstation clusters. However, operating systems still treat workstation clusters as a collection of loosely-connected processors, where each workstation acts as an autonomous and independent agent. This operating system structure makes it difficult to exploit the characteristics of current clusters, such as low-latency communication, huge primary memories, and high-speed processors, in order to improve the performance of cluster applications. This paper describes the design and implementation of global memory management in a workstation cluster. Our objective is to use a single, unified, but distributed memory management algorithm at the lowest level of the operating system. By managing memory globally at this level, all system- and higher-level software, including VM, file systems, transaction systems, and user applications, can benefit from available cluster memory. We have implemented our algorithm in the OSF/1 operating system running on an ATM-connected cluster of DEC Alpha workstations. Our measurements show that on a suite of memory-intensive programs, our system improves performance by a factor of 1.5 to 3.5. We also show that our algorithm has a performance advantage over others that have been proposed in the past. TR # 95-09-01 * Shun-Tak A. Leung and John Zahorjan "Optimizing Data Locality by Array Restructuring" It is increasingly important that optimizing compilers restructure programs for data locality to obtain high performance on today's powerful architectures. In this paper, we focus on array restructuring, a technique that improves the spatial locality exhibited by array accesses in nested loops. Specifically, we address the following question: Given a set of such accesses, how should the array elements be laid out in memory to match the access pattern and thus maximize locality? Our approach is based on an invertible linear transformation of array index vectors. We present algorithms to choose a suitable transformation, and hence array layout, given the set of array accesses. Our analysis places no restrictions on the loop's nesting structure or dependence pattern. Although we focus on cases where the array indexing expressions are affine functions of loop variables, our techniques can be applied to the non-affine case as well. We have implemented our technique in the SUIF compiler. Experimental results show that array restructuring improves loop execution performance substantially, often with no or little runtime overhead. Moreover, the performance improvement of array restructuring compares favorably with that achieved by loop restructuring. TR # 95-09-02 * Thu Nguyen, Raj Vaswani, and John Zahorjan "Maximizing Speedup Through Self-Tuning Processor Allocation" We address the problem of maximizing the speedup of an individual parallel job through the selection of an appropriate number of processors on which to run it. If a parallel job exhibits speedup that increases monotonically in the number of processors, the solution is clearly to make use of all available processors. However, many applications do not have this characteristic: they reach a point beyond which the use of additional processors degrades performance. For these applications, it is important to choose a processor allocation carefully. Our approach to this problem is to provide a runtime system that adjusts the number of processors used by the application based on dynamic measurements of performance gathered during its execution. Our runtime system has a number of advantages over user specified fixed allocations, the currently most common approach to this problem: (1) we are resilient to changes in an application's speedup behavior due to the input data; and (2) we are able to change the allocation in response to speedup behavior that varies during the lifetime of the application. This work has been done in the context of shared memory multiprocessors, such as the KSR-2 on which we ran our experiments, and for iterative parallel applications, those for which the bulk of the execution is driven by an outer sequential loop. Using both hand-tuned applications from the SPLASH benchmarks and compiler-parallelized applications from the PERFECT Club benchmarks, we show that automatic, dynamic selection of processor allocations can reliably determine allocations that match the best possible static allocation, and has the potential to find allocations that outperform any static allocation. TR # 95-10-01 * Thu Nguyen, Raj Vaswani, and John Zahorjan "Using Runtime Measured Workload Characteristics in Parallel Processor Scheduling" We consider the use of runtime measured workload characteristics in parallel processor scheduling. Although many researchers have considered the use of application characteristics in this domain, most of this work has assumed that such information is available a priori. In contrast, we propose and evaluate experimentally dynamic processor allocation policies that rely on learning job characteristics at runtime; in particular, we focus on measuring and using job efficiency and speedup. Our work is intended to be a first step towards the eventual development of production schedulers that use runtime measured workload characteristics in making their decisions. The experimental results we present validate the following observations: - Despite the inherent inaccuracies of runtime measurements and the added overhead of more frequent reallocations, schedulers that use runtime measurements of workload characteristics can significantly outperform schedulers that are oblivious to these characteristics. - Runtime measurements are sufficient for schedulers to achieve performance surprisingly close to that possible when {\em a priori} efficiency and speedup information is available. - The primary performance loss, relative to the use of {\em a priori} information, is due to the transient decisions of the schedulers as they acquire information on the running applications, rather than to measurement and reallocation overheads. We consider both interactive environments, in which a response time directed scheduler is appropriate, and batch environments, in which maximizing useful instruction throughput is the primary goal. Our experiments are performed using prototype implementations running on a 50-node KSR-2 shared memory multiprocessor. TR # 95-10-02 * Craig Anderson "Improving Performance of Bus-Based Multiprocessors" Processors have become both cheaper and faster in recent years, to the point where it has become practical to use multiple processors in a single system. An important aspect of such systems is how processors are connected to each other and to main memory. The simplest design uses a single, common bus. The main problem with single bus multiprocessors is the lack of bus bandwidth, which limits the number of processors that can be connected effectively. This dissertation addresses this problem by examining protocol improvements for single-bus multiprocessors as well as investigating the use of clustering to increase the number of processors that can be connected together effectively. We developed a subblock cache coherency protocol to enable programs to take advantage of a program's spatial locality, while avoiding false sharing. In addition, the use of read broadcasting is investigated. In nearly all cases, the subblock protocol alone performed as well or better than the best execution of a standard invalidate protocol. Simulation results also show that using read broadcast substantially reduces the number of block re-reads due to invalidation when false sharing is present. Using read broadcast with the subblock protocol further increases the protocol's performance. This dissertation also proposes and evaluates a cache coherence protocol that dynamically chooses either an update or invalidate-based coherence scheme on a block-by-block basis based on the application's write patterns. The hybrid adaptive scheme performs as well as or superior to the better of the two standard protocols. The second part of this dissertation concerns the use of processor clustering to increase the number of processors that can be connected together effectively. A detailed invalidation-based hierarchical bus protocol is presented. Next, the benefits of moving main memory into the clusters and allocating data there are tested using simulation. The results show that cluster allocation had mixed results, improving performance for some applications while showing no benefit for others. The subblock protocol mentioned above is extended to the hierarchical bus system. Simulation results show that the benefits of using the subblock protocol are nearly as good as those obtained on the single bus architecture. TR # 95-11-05 * Calvin Lin, Lawrence Snyder, Ruth Anderson, Brad Chamberlain, Sung-Eun Choi, George Forman, E. Christopher Lewis, W. Derrick Weathersby "ZPL vs. HPF: A Comparison of Performance and Programming Style" This paper compares the performance of two data parallel array languages, ZPL and HPF, for seven programs from the NAS, NASA, and GENESIS benchmark suites. This paper also compares the two languages' effect on programming style. The results show that for these benchmarks, ZPL programs generally outperform HPF and that ZPL expresses problems at higher levels of abstraction, yielding programs that are shorter, less error-prone and easier to maintain. ZPL's better performance seems to come from its more direct expression of parallelism that appears to allow better compiler analysis. TR # 95-11-06 * Sung-Eun Choi "An Overview of CompilerTechniques for Interprocedural Array Section Analysis" Dependence analysis of arrays is crucial in the compilation of parallel applications, as an inaccurate summary of array usage may reduce the potential for optimizations. Standard scalar techniques are inadequate for they do not accommodate specific accesses to arrays. {\it Array section analysis} describes accesses to arrays at a finer granularity than the scalar techniques. More precisely, array section analysis techniques summarize a collection of accesses to a specific array in a procedure. In this paper, we present a summary of existing array section analysis techniques for interprocedural dependence analysis. We identify and compare two classes of such techniques and give suggestions for improving the techniques. TR # 95-12-01 * Jean-Loup Baer and Craig Anderson "On the Performance of a Bus-Based Multiprocessor Cluster Architecture" The focus of this paper is on the evaluation of a hierarchical cluster architecture where a cluster consists of a single bus shared-memory multiprocessor and where the interconnect is a tree hierarchy of busses. We first outline a cache coherence protocol for a UMA architecture. We then introduce a variation of the protocol for sector caches where the coherence and transfer units are not the same. We evaluate, through cycle by cycle simulation, the UMA architecture under these two protocols. We simulate six benchmarks with varied memory access patterns and show that the clustering concept is beneficial as long as architectures are balanced. The subblock protocol works well and is more stable than protocols with fixed block size that can behave very well on some applications and very poorly on others. In the second part of the paper, we modify the UMA architectureinto a NUMA architecture. We discuss several ways to perform memory allocation in this new context. We test some of these allocations on two of the benchmarks and report that improvements are largest when the memory allocation does not induce much extra computations and when architectures are balanced. TR # 96-01-01 Adam Finkelstein, Charles E. Jacobs and David H. Salesin "Multiresolution Video" We present a new representation for time-varying image data, called multiresolution video. The representation allows for varying -- and arbitrarily high -- spatial and temporal resolutions in different parts of a video sequence. The representation is based on a sparse, hierarchical encoding of the video data. We show how multiresolution video supports a number of primitive operations: drawing frames at a particular spatial and temporal resolution; and translating, scaling, and compositing multiresolution sequences. These primitives are then used as the building blocks to support a variety of applications: video compression; multiresolution playback, including motion-blurred "fast-forward" and "reverse"; constant speed display; enhanced video scrubbing; and "video clip art" editing and compositing. The multiresolution representation requires little storage overhead, and the algorithms using the representation are both simple and efficient. TR # 96-01-02 * Michael Salisbury, Corin Anderson, Dani Lischinski and David Salesin "A Resolution-Independent Representation for Pen-and-Ink Illustration" This paper describes a compact resolution- and scale-independent representation for pen-and-ink illustrations. The proposed representation consists of a low-resolution grey-scale image, augmented by a set of discontinuity segments. We also present a new reconstruction algorithm that magnifies the low-resolution image while keeping the image sharp along the discontinuities. By storing pen-and-ink illustrations in this representation, we can produce high-fidelity illustrations at any scale and resolution by generating an image of the desired size and filling that image with pen-and-ink strokes. TR # 96-01-03 Robert B. Doorenbos, Oren Etzioni and Daniel S. Weld "A Scalable Comparison-Shopping Agent for the World-Wide Web" The Web is less agent-friendly than we might hope. Most information on the Web is presented in loosely structured natural language text with no agent-readable semantics. HTML annotations structure the display of Web pages, but provide virtually no insight into their content. Thus, the designers of intelligent Web agents need to address the following questions: (1) To what extent can an agent understand information published at Web sites? (2) Is the agent's understanding sufficient to provide genuinely useful assistance to users? (3) Is site-specific hand-coding necessary, or can the agent automatically extract information from unfamiliar Web sites? (4) What aspects of the Web facilitate this competence? In this paper we investigate these issues with a case study using the ShopBot. ShopBot is a fully-implemented, domain-independent comparison-shopping agent. Given the home pages of several on-line stores, ShopBot autonomously learns how to shop at those vendors. After its learning is complete, ShopBot is able to speedily visit over a dozen software stores and CD vendors, extract product information such as availability and price, and summarize the results for the user. Preliminary studies show that ShopBot enables users to both find superior prices and substantially reduce Web shopping time. Remarkably, ShopBot achieves this performance without sophisticated natural language processing, and requires only minimal knowledge about different product domains. Instead, ShopBot relies on a combination of heuristic search, pattern matching, and inductive learning techniques.