DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIVERSITY OF WASHINGTON
TECHNICAL REPORT ABSTRACTS - 95-05-02, 95-08-01 -> 96-01-03


Obtaining Electronic Technical Reports:
  Technical reports in this list with an asterisk (*) following the
  TR number are available on-line as of January 1996
  via FTP (ftp.cs.washington.edu), or
  the WWW (http://www.cs.washington.edu/research/tr/techreports.html)

Obtaining Hardcopy Technical Reports:
  To obtain a technical report on paper, sent to you via Snail Mail,
  please contact <tr-request@cs.washington.edu>.

If you need additional information or assistance, please contact:
  <tr-request@cs.washington.edu>


                          ABSTRACTS
  

TR # 95-05-02 *
Bradford Chamberlain, Tony DeRose, Dani Lischinski, David Salesin, and 
John Snyder
"Fast Rendering of Complex Environments Using a Spatial Hierarchy"

This paper presents work in progress on an automated method for 
accelerating the rendering of complex static scenes.  The technique is
general in that it is applicable to unstructured scenes consisting of
arbitrary geometric primitives without relying on the presence of specific
occlusive qualities in the scene to achieve its speed.  Our approach is to
place a spatially hierarchical decomposition over the scene and to
construct a simplified representation of each cell in the hierarchy that
approximates the general appearance of its contents.  The scene is then
rendered using a traversal of the hierarchy in which a cell's
approximation is drawn if it is sufficiently accurate.  We apply the
method to several scenes, present performance results, and discuss
artifacts caused by the approximation.  Additionally, we address
limitations in the current implementation and present our plans to
circumvent these problems. 


TR # 95-08-01 *
Gail C. Murphy, David Notkin and Erica S.-C Lan
"An Empirical Study of Static Call Graph Extractors"

Informally, a call graph represents calls between entities in a given
program.  Optimizing compilers often compute call graphs to determine
the applicability of many transformations.  These call graphs must
typically be conservative: a call may be omitted only if it can never
occur in any execution of the program.  Numerous software engineering
tools also extract call graphs with the expectation that they will
help software engineers increase their understanding of a program. The
requirements placed on software engineering tools when computing call
graphs are typically more relaxed than for compilers. For example,
some false negatives---calls that can in fact take place in some
execution of the program, but which are omitted from the call
graph---may be acceptable, depending on the understanding task at
hand.  In this paper we empirically show a consequence of this
spectrum of requirements by comparing the C call graphs extracted from
three software systems (mapmaker, mosaic, and gcc) by several
extraction tools (cflow, CIA, Field, mkfunctmap, and rigiparse).  A
quantitative analysis of the call graphs extracted for each system
shows considerable variation, a result that is counterintuitive to
many experienced software engineers. A qualitative analysis of these
results reveals a number of reasons for this variation: differing
treatments of macros, function pointers, input formats, etc.  In this
paper, we describe and discuss the study, sketch the design space, and
discuss the impact of our study on practitioners, tool developers, and
researchers.


TR # 95-08-07 *
Michael J. Feeley, William E. Morgan, Frederic H. Pighin, Anna R. Karlin, 
Henry M. Levy, and Chandramohan A. Thekkath
"Implementing Global Memory Management in a Workstation Cluster"

Advances in network and processor technology have greatly changed the
communication and computational power of local-area workstation
clusters.  However, operating systems still treat workstation clusters
as a collection of loosely-connected processors, where each
workstation acts as an autonomous and independent agent.  This
operating system structure makes it difficult to exploit the
characteristics of current clusters, such as low-latency
communication, huge primary memories, and high-speed processors, in
order to improve the performance of cluster applications.

This paper describes the design and implementation of global memory
management in a workstation cluster.  Our objective is to use a
single, unified, but distributed memory management algorithm at the
lowest level of the operating system.  By managing memory globally at
this level, all system- and higher-level software, including VM, file
systems, transaction systems, and user applications, can benefit from
available cluster memory.  We have implemented our algorithm in the
OSF/1 operating system running on an ATM-connected cluster of DEC
Alpha workstations.  Our measurements show that on a suite of
memory-intensive programs, our system improves performance by a factor
of 1.5 to 3.5.  We also show that our algorithm has a performance
advantage over others that have been proposed in the past.


TR # 95-09-01 *
Shun-Tak A. Leung and John Zahorjan
"Optimizing Data Locality by Array Restructuring"

It is increasingly important that optimizing compilers restructure programs
for data locality to obtain high performance on today's powerful
architectures.  In this paper, we focus on array restructuring, a technique
that improves the spatial locality exhibited by array accesses in nested
loops.  Specifically, we address the following question: Given a set of
such accesses, how should the array elements be laid out in memory to match
the access pattern and thus maximize locality?

Our approach is based on an invertible linear transformation of array index
vectors.  We present algorithms to choose a suitable transformation, and
hence array layout, given the set of array accesses.  Our analysis places
no restrictions on the loop's nesting structure or dependence pattern.
Although we focus on cases where the array indexing expressions are affine
functions of loop variables, our techniques can be applied to the
non-affine case as well.

We have implemented our technique in the SUIF compiler.  Experimental
results show that array restructuring improves loop execution performance
substantially, often with no or little runtime overhead.  Moreover, the
performance improvement of array restructuring compares favorably with that
achieved by loop restructuring.


TR # 95-09-02 *
Thu Nguyen, Raj Vaswani, and John Zahorjan
"Maximizing Speedup Through Self-Tuning Processor Allocation"


We address the problem of maximizing the speedup of an individual 
parallel job through the selection of an appropriate number of processors 
on which to run it.  If a parallel job exhibits speedup that increases 
monotonically in the number of processors, the solution is clearly to 
make use of all available processors. However, many applications do not
have this characteristic: they reach a point beyond which the use of 
additional processors degrades performance.  For these applications, it 
is important to choose a processor allocation carefully.

Our approach to this problem is to provide a runtime system that
adjusts the number of processors used by the application based on
dynamic measurements of performance gathered during its execution.
Our runtime system has a number of advantages over user specified
fixed allocations, the currently most common approach to this problem:
(1) we are resilient to changes in an application's speedup behavior
due to the input data; and (2) we are able to change the allocation in
response to speedup behavior that varies during the lifetime of the
application.

This work has been done in the context of shared memory
multiprocessors, such as the KSR-2 on which we ran our experiments,
and for iterative parallel applications, those for which the bulk of
the execution is driven by an outer sequential loop.  Using both
hand-tuned applications from the SPLASH benchmarks and
compiler-parallelized applications from the PERFECT Club benchmarks,
we show that automatic, dynamic selection of processor allocations can
reliably determine allocations that match the best possible static
allocation, and has the potential to find allocations that outperform
any static allocation.


TR # 95-10-01 *
Thu Nguyen, Raj Vaswani, and John Zahorjan
"Using Runtime Measured Workload Characteristics in Parallel Processor 
Scheduling"

We consider the use of runtime measured workload characteristics in
parallel processor scheduling.  Although many researchers have
considered the use of application characteristics in this domain,
most of this work has assumed that such information is available a
priori.  In contrast, we propose and evaluate experimentally dynamic
processor allocation policies that rely on learning job
characteristics at runtime; in particular, we focus on measuring and
using job efficiency and speedup.

Our work is intended to be a first step towards the eventual
development of production schedulers that use runtime measured
workload characteristics in making their decisions.  The experimental
results we present validate the following observations:

- Despite the inherent inaccuracies of runtime measurements and
  the added overhead of more frequent reallocations, schedulers that use
  runtime measurements of workload characteristics can significantly
  outperform schedulers that are oblivious to these characteristics.

- Runtime measurements are sufficient for schedulers to achieve
  performance surprisingly close to that possible when {\em a priori}
  efficiency and speedup information is available.

- The primary performance loss, relative to the use of {\em a priori}
  information, is due to the transient decisions of the schedulers as
  they acquire information on the running applications, rather than to
  measurement and reallocation overheads.

We consider both interactive environments, in which a response time
directed scheduler is appropriate, and batch environments, in which
maximizing useful instruction throughput is the primary goal.  Our
experiments are performed using prototype implementations running on a
50-node KSR-2 shared memory multiprocessor.


TR # 95-10-02 *
Craig Anderson
"Improving Performance of Bus-Based Multiprocessors"

Processors have become both cheaper and faster in recent years, to the 
point where it has become practical to use multiple processors in a 
single system.  An important aspect of such systems is how processors are 
connected to each other and to main memory.  The simplest design uses a 
single, common bus.  The main problem with single bus multiprocessors is 
the lack of bus bandwidth, which limits the number of processors that can be
connected effectively.  This dissertation addresses this problem by
examining protocol improvements for single-bus multiprocessors as well as 
investigating the use of clustering to increase the number of processors 
that can be connected together effectively.

We developed a subblock cache coherency protocol to enable programs to take 
advantage of a program's spatial locality, while avoiding false sharing.
In addition, the use of read broadcasting is investigated. In nearly all 
cases, the subblock protocol alone performed as well or better than the 
best execution of a standard invalidate protocol.  Simulation results 
also show that using read broadcast substantially reduces the number of block
re-reads due to invalidation when false sharing is present. Using read
broadcast with the subblock protocol further increases the protocol's 
performance.

This dissertation also proposes and evaluates a cache coherence protocol that
dynamically chooses either an update or invalidate-based coherence scheme 
on a block-by-block basis based on the application's write patterns.  The 
hybrid adaptive scheme performs as well as or superior to the better of 
the two standard protocols.

The second part of this dissertation concerns the use of processor 
clustering to increase the number of processors that can be connected 
together effectively.  A detailed invalidation-based hierarchical bus
protocol is presented.  Next, the benefits of moving main memory into the 
clusters and allocating data there are tested using simulation.  The 
results show that cluster allocation had mixed results, improving
performance for some applications while showing no benefit for others. 
The subblock protocol mentioned above is extended to the hierarchical
bus system.  Simulation results show that the benefits of using the 
subblock protocol are nearly as good as those obtained on
the single bus architecture.


TR # 95-11-05  *
Calvin Lin, Lawrence Snyder, Ruth Anderson, Brad Chamberlain, Sung-Eun 
Choi, George Forman, E. Christopher Lewis, W. Derrick Weathersby
"ZPL vs. HPF: A Comparison of Performance and Programming Style"

This paper compares the performance of two data parallel array
languages, ZPL and HPF, for seven programs from the NAS, NASA, and
GENESIS benchmark suites. This paper also compares the two languages'
effect on programming style. The results show that for these
benchmarks, ZPL programs generally outperform HPF and that ZPL
expresses problems at higher levels of abstraction, yielding programs
that are shorter, less error-prone and easier to maintain. ZPL's
better performance seems to come from its more direct expression of
parallelism that appears to allow better compiler analysis.


TR # 95-11-06 *
Sung-Eun Choi
"An Overview of CompilerTechniques for Interprocedural Array Section 
Analysis"

Dependence analysis of arrays is crucial in the compilation of
parallel applications, as an inaccurate summary of array usage may
reduce the potential for optimizations.  Standard scalar techniques
are inadequate for they do not accommodate specific accesses to
arrays.  {\it Array section analysis} describes accesses to arrays at
a finer granularity than the scalar techniques.  More precisely, array
section analysis techniques summarize a collection of accesses to a
specific array in a procedure.  In this paper, we present a summary of
existing array section analysis techniques for interprocedural
dependence analysis.  We identify and compare two classes of such
techniques and give suggestions for improving the techniques.


TR # 95-12-01 *
Jean-Loup Baer and Craig Anderson
"On the Performance of a Bus-Based Multiprocessor Cluster Architecture"

The focus of this paper is on the evaluation of a hierarchical cluster 
architecture where a cluster consists of a single bus shared-memory 
multiprocessor and where the interconnect is a tree hierarchy of busses.

We first outline a cache coherence protocol for a UMA architecture.  We 
then introduce a variation of the protocol for sector caches where the 
coherence and transfer units are not the same.  We evaluate, through 
cycle by cycle simulation, the UMA architecture under these two 
protocols.  We simulate six benchmarks with varied memory access patterns 
and show that the clustering concept is beneficial as long as 
architectures are balanced.  The subblock protocol works well and is more 
stable than protocols with fixed block size that can behave very well on 
some applications and very poorly on others.

In the second part of the paper, we modify the UMA architectureinto a 
NUMA architecture.  We discuss several ways to perform memory allocation 
in this new context.  We test some of these allocations on two of the 
benchmarks and report that improvements are largest when the memory 
allocation does not induce much extra computations and when architectures 
are balanced.


TR # 96-01-01 
Adam Finkelstein, Charles E. Jacobs and David H. Salesin
"Multiresolution Video"

We present a new representation for time-varying image data, called
multiresolution video.  The representation allows for varying -- and
arbitrarily high -- spatial and temporal resolutions in different
parts of a video sequence.  The representation is based on a sparse,
hierarchical encoding of the video data.  We show how multiresolution
video supports a number of primitive operations: drawing frames at a
particular spatial and temporal resolution; and translating, scaling,
and compositing multiresolution sequences.  These primitives are then
used as the building blocks to support a variety of applications:
video compression; multiresolution playback, including motion-blurred
"fast-forward" and "reverse"; constant speed display; enhanced video
scrubbing; and "video clip art" editing and compositing.  The
multiresolution representation requires little storage overhead, and
the algorithms using the representation are both simple and efficient.


TR # 96-01-02 *
Michael Salisbury, Corin Anderson, Dani Lischinski and David Salesin
"A Resolution-Independent Representation for Pen-and-Ink Illustration"

This paper describes a compact resolution- and scale-independent 
representation for pen-and-ink illustrations. The proposed representation 
consists of a low-resolution grey-scale image, augmented by a set of 
discontinuity segments. We also present a new reconstruction algorithm that 
magnifies the low-resolution image while keeping the image sharp along the 
discontinuities. By storing pen-and-ink illustrations in this representation,
we can produce high-fidelity illustrations at any scale and resolution by 
generating an image of the desired size and filling that image with 
pen-and-ink strokes.


TR # 96-01-03
Robert B. Doorenbos, Oren Etzioni and Daniel S. Weld
"A Scalable Comparison-Shopping Agent for the World-Wide Web"

The Web is less agent-friendly than we might hope.  Most information
on the Web is presented in loosely structured natural language text
with no agent-readable semantics.  HTML annotations structure the
display of Web pages, but provide virtually no insight into their
content.  Thus, the designers of intelligent Web agents need to
address the following questions: (1) To what extent can an agent
understand information published at Web sites? (2) Is the agent's
understanding sufficient to provide genuinely useful assistance to
users?  (3) Is site-specific hand-coding necessary, or can the agent
automatically extract information from unfamiliar Web sites?  (4) What
aspects of the Web facilitate this competence?

In this paper we investigate these issues with a case study using the
ShopBot.  ShopBot is a fully-implemented, domain-independent
comparison-shopping agent.  Given the home pages of several on-line
stores, ShopBot autonomously learns how to shop at those vendors.
After its learning is complete, ShopBot is able to speedily visit over
a dozen software stores and CD vendors, extract product information
such as availability and price, and summarize the results for the
user.  Preliminary studies show that ShopBot enables users to both
find superior prices and substantially reduce Web shopping time.

Remarkably, ShopBot achieves this performance without sophisticated
natural language processing, and requires only minimal knowledge about
different product domains.  Instead, ShopBot relies on a combination
of heuristic search, pattern matching, and inductive learning
techniques.