DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
                    UNIVERSITY OF WASHINGTON
           TECHNICAL REPORT ABSTRACTS - 94-02-01 -> 94-05-07


Obtaining Electronic Technical Reports:
  Technical reports in this list with an asterisk (*) following the
  TR number are available on-line as of April 1995
  via FTP (ftp.cs.washington.edu), or
  the WWW (http://www.cs.washington.edu/research/tr/techreports.html)

Obtaining Hardcopy Technical Reports:
  To obtain a technical report on paper, sent to you via Snail Mail,
  please contact <tr-request@cs.washington.edu>.

If you need additional information or assistance, please contact:
  <tr-request@cs.washington.edu>

 
                          ABSTRACTS
  
TR # 94-02-01 *
"Restructuring Arrays for Efficient Parallel Loop Execution"
Shun-Tak Leung and John Zahorjan

In a sequential program, data are often structured in a way that is
optimized for a sequential execution.  However, when the program is
parallelized, the data access pattern may change drastically.  If the
structure of the data is not changed accordingly, parallel performance
will suffer.  In this paper, we consider this problem in the context
of runtime loop parallelization which is a general technique to
parallelize loops not amenable to compile-time analysis.  In a
parallel execution of a loop, iteration may be performed in a
very different order than in the sequential execution.  This may
result in undesirable cache effects on distributed shared-memory
multiprocessors, unless the structure of the arrays accessed by these
iterations is changed accordingly.  We discuss what these problems are
and how they arise.  We then describe two data restructuring
techniques to address them: the restructuring of read-write arrays to
reduce inter-processor communication due to false sharing, and the
restructuring of read-only arrays to improve spatial locality.
We also report experiments on a KSR1 to evaluate the effectiveness
of these techniques and the preprocessing and postprocessing overheads
they entail.  The results show that the restructuring techniques can
substantially improve performance of the parallelized loop.  When
restructuring overheads are ignored, we see a doubling of parallel
speedups.  While restructuring overheads can be quite significant,
they can often be amortized across multiple loop executions so that
they do not outweigh the performance benefits.  In our experiments, it
only takes two loop executions to achieve this.


TR # 94-02-02 *
"An Algorithm for Exact Bounds on the Time Separation of Events in 
Concurrent Systems"
Henrik Hulgaard, Steven M. Burns, Tod Amon, Gaetano Borriello 

Determining the time separation of events is a fundamental problem in
the analysis, synthesis, and optimization of concurrent systems.
Applications range from logic optimization of asynchronous digital
circuits to evaluation of execution times of programs for real-time
systems.  We present an efficient algorithm to find exact (tight)
bounds on the separation time of events in an arbitrary process graph
without conditional behavior.  This result is more general than the
methods presented in several previously published papers as it handles
cyclic graphs and yields the tightest possible bounds on event
separations.  The algorithm is based on a functional decomposition
technique that permits the implicit evaluation of an infinitely
unfolded process graph.  Examples are presented that demonstrate the
utility and efficiency of the solution.  The algorithm will form a
basis for exploration of timing-constrained synthesis techniques.


TR # 94-02-03 *
"On Scalable State-Based Specifications for Real-Time Systems"
Alan Shaw

Using our communicating real-time state machine (CRSM) language as a
basis, we propose and develop a methodology for specifying
requirements and designs for large real-time systems.  CRSMs are
distributed state machines with novel and general timing facilities
and CSP-like synchronous communications.  The paper first presents a
particular controller-client (CC) architecture for composing CRSMs
into larger components and then uses the CC organization to define a
number of standard in-the-large paradigms for real-time and other
software. 


TR # 94-02-04 *
"The Case for Chaotic Adaptive Routing"
Kevin Bolding, Melanie L. Fulgham, Lawrence Snyder

Chaotic routers are randomizing, non-minimal adaptive packet routers
designed for use in the communication networks of parallel computers.
Chaotic routing is reviewed along with other contemporary network
routing approaches, including the state-of-the-art oblivious routers.
Each routing approach is evaluated for its effectiveness as a
multicomputer message router.  The results indicate that the Chaos
router is the most effective of known routing methods.


TR # 94-02-05 *
"Identifying Profitable Specialization in Object-Oriented Languages"
Jeffrey Dean, Craig Chambers, and David Grove

The performance of object-oriented languages can be greatly improved
if methods can be specialized for particular classes of arguments.
Such specialization can provide the compiler with enough class
information about the receivers of messages within the specialized
routine to enable these messages to be statically-bound to their
target methods and subsequently inlined.  We present an algorithm for
automatically determining which methods are most profitable to
specialize for which argument classes.  This algorithm improves on
previous automatic techniques by avoiding the twin problems of over-
and underspecialization and by being suitable for specializing
programs that use multi-methods.


TR # 94-03-01 
"Typechecking and Modules for Multi-Methods"
Craig Chambers and Gary T. Leavens

Two major obstacles preventing the wider acceptance of multi-methods
are concerns over the lack of encapsulation and modularity and the
lack of static typechecking in existing multi-method-based languages.
This paper addresses both of these problems.  We present a
polynomial-time static typechecking algorithm that checks conformance,
completeness, and consistency of a group of method implementations
with respect to declared message signatures.  This algorithm improves
on previous algorithms by handling separate type and inheritance
hierarchies, the presence of abstract classes, and graph-based method
look-up semantics.  We prove formally that our algorithm fulfills its
specification.  We also present a module system that enables
independently-developed code to be fully encapsulated and statically
typechecked on a per-module basis.  To guarantee that potential
conflicts between independently-developed modules have been resolved,
a simple well-formedness condition on the molecules comprising a
program is checked at link-time.  The typechecking algorithm and
module system are applicable to a range of multi-method-based
languages, but the paper uses the Cecil language as a concrete example
of how they can be applied.


TR # 94-03-02
"A New Data Structure for Fast Approximate Matching"
Andrew P. Berman

Given a set of objects S and a metric D, we describe how to
represent S as a new data structure, the "triangulation trie".  This
data structure can be used to search through S quickly to find
approximate matches to a given object.  Using the triangle inequality,
the search tree is repeatedly pruned to reduce the number of object
comparisons required.  Much of the work is done within the tree using
integer comparisons.  This method can result in very fast database
searches in applications where object comparisons are traditionally
costly.  

Furthermore, the data structure seems to be applicable to a very wide
variety of object types.  The trie is unusual in its construction in
that objects are partitioned according to their respective distances
from a common set of "key" objects.


TR # 94-03-03 *
"SPIN - An Extensible Microkernel for Application-specific Operating
System Services"
Brian Bershad, Craig Chambers, Susan Eggers, Chris Maeda, Dylan
McNamee, Przemyslaw Pardyak, Stefan Savage, Emin Gun Sirer

Application domains, such as multimedia, databases, and parallel
computing, require operating system services with high performance and
high functionality.  Existing operating systems provide fixed
interfaces and implementations to system services and resources.  This
makes them inappropriate for applications whose resource demands and
usage patterns are poorly matched by the services provided.  The SPIN
operating system enables system services to be defined in an
application-specific fashion through an extensible microkernel.  It
offers fine-grained control over a machine's logical and physical
resources to applications through run-time adaptation of the system to
application requirements.


TR # 94-03-04 *
"Interface Timing Verification with Combined MAX and LINEAR
Constraints"
Elizabeth A. Walkup, Gaetano Borriello

A fundamental timing analysis problem in the verification and
synthesis of interface logic circuitry is the determination of the
possible and allowable time separations, or "skews" between interface
events, given timing constraints and propagation delays between the
events generated by the circuits the interface connects.  These skews
are used to verify timing properties and determine allowable
propagation delays for logic synthesis.  The main contributions of
this report are two-old.  First, this report shows that the
verification problem can be expressed with constraints of the form 
x_i <= Max{x_{j1} + Delta_{j1,i}, ..., x_{jm} + Delta_{jm,i}}, such as
those described in several other domains including the {Max, +}
algebra used in modeling discrete event systems.  Second, this report
presents and proves correct an algorithm that provides tight upper
bounds on the time separation between all pairs x_i, x_j for such
constraint set in less time and with tighter bounds than previous
algorithms. 


TR # 94-03-05 * 
"Measurement and Application of Dynamic Receiver Class Distributions"
Charles D. Garrett, Jeffrey Dean, David Grove, Craig Chambers

Dynamic binding slows down object-oriented programs.  Dynamic dispatch
mechanisms which work well where all receiver classes are equally
likely are too pessimistic because at most call sites one receiver
predominates.  We apply dynamic profile information to determine the
dynamic execution frequency distributions of the classes of receivers
at call sites.  We show that these distributions are heavily skewed
towards the most commonly occurring receiver class across several
different languages.  Moreover, we show that the distributions are
stable across program inputs, from one version of a program to
another, and even to some extent across programs that share library
code.  Finally, we demonstrate that significant run-time performance
improvements for object-oriented programs can be gained by exploiting
the information contained in dynamic receiver class distributions in
a relatively simple optimizing compiler.


TR # 94-03-06 *
"Testing Asynchronous Circuits: A Survey"
Henrik Hulgaard, Steve Burns, Gaetano Borriello

Asynchronous circuit design has been studied for decades, but it has
only recently been feasible to construct large and efficient
asynchronous systems.  This paper surveys different techniques for
checking whether an asynchronous circuit has fabrication defects.  The
inherent differences between asynchronous and synchronous circuits,
primarily that asynchronous circuits do not have a global clock,
necessitate a review of the testing techniques used for synchronous
circuits and a re-evaluation of the trade-offs involved.  New methods
that efficiently utilize the structure of asynchronous circuits are
possible, most notably self-checking asynchronous circuits can be
implemented with little or no circuit overhead. 


94-03-07 *
"Constraints and Object Identity" 
Gus Lopez, Bjorn Freeman-Benson, and Alan Borning

Constraint imperative programming is an integration of declarative
constraints and imperative object-oriented programming. The primary
goal of this integration is to use constraints to express relations
among objects explicitly--relations that were implicit in the code in
previous languages. However, one of the fundamental concepts of
object-oriented programming, object identity, can result in implicit
relations, even when explicit identity constraints are supported. We
analyze the problem and propose a solution--identity
constraints--which we have implemented in our Kaleidoscope'93
language. This solution is understandable, efficiently implementable,
and compatible with the  Kaleidoscope constraint model.


TR # 94-04-01 *
"Pin Assignment for Multi-FPGA Systems"
Scott Hauck and Gaetano Borriello

There is currently great interest in using systems of FPGAs for logic
emulators, custom computing devices, and software accelerators.  An
important step in making these technologies more generally useful is
to develop completely automatic mapping tools from high-level
specification to FPGA programming files.  In this paper we examine one
step in this automatic mapping process, the selection of physical FPGA
pins to use for routing inter-FPGA signals.  We present an algorithm
that greatly increases mapping speed while also improving mapping
quality.


TR # 94-04-02 *
"A Group Structuring Mechanism for a Distributed Object-oriented
Language" 
Przemyslaw Pardyak and Brian N. Bershad

This paper describes a structuring mechanism for grouping objects in a
distributed object-oriented language.  A group structuring mechanism
provides a single flexible method for managing distributed
applications that involve complicated communication protocols and
sophisticated structure.  We have added such a mechanism to the
Emerald distributed object-oriented language and its runtime system.
Our group structuring mechanism fits entirely within the context of
object-oriented programming, so similar mechanisms could be added to
other distributed object-oriented languages.


TR # 94-05-01  
"An Architecture-Adaptive, Performance-Driven Router for FPGAs"
Larry McMurchie, Carl Ebeling, and Gaetano Borriello

We present an architecture-adaptive, performance-driven router for
FPGAs.  In this router, delay minimization and congestion elimination
are balanced on an iteration-by-iteration basis.  The signal ordering
dependency that is characteristic of most obstacle avoidance routers
is reduced by allowing signals to share nodes and adjusting the cost
of those nodes in a semi-equilibrium fashion.  Because it requires
only a directed graph to describe the architecture of routing
resources, the router adapts readily to a wide variety of FPGAs
architectures, such as Triptych, Xilinx 3000 and mesh connected arrays
of FPGAs.  The results of routing ISCAS benchmarks on the Triptych
FPGA architecture show an average increase of only 4.5% in critical
path delay from the optimum delay for the placement.  Routes of ISCAS
benchmarks on the Xilinx 3000 architecture show a greater completion
rate than commercial tools as well as 11% shorter critical paths.


TR # 94-05-02 *
"Design and Evaluation of a Subblock Cache Coherence Protocol for
Bus-Based Multiprocessors"
Craig Anderson and Jean-Loup Baer

Parallel applications exhibit a wide variety of memory reference
patterns.  Designing a memory architecture that serves all
applications well is not easy.  However, because tolerating or
reducing memory latency is a priority in effective parallel
processing, it is important to explore new techniques to reduce memory
traffic.

In this paper, we describe a snoopy cache coherence protocol that uses
a large sized transfer block (to take advantage of spatial locality)
while using a small coherence block in order to avoid false sharing.
To further illustrate the protocol, we present an example of its
workings.  We then present the results of simulating our protocol on 5
applications that exhibit a variety of reference patterns.  We find
that our protocol effectively takes advantage of spatial locality
while avoiding the increase in false sharing that often occurs when
using large line sizes.


TR # 94-05-03  
"Self-Tuned Clocks and Crystal Clocks"
Ted Kehl, Steve Burns, Chris Fisher

Self-tuning measures the maximum clocking rate of a circuit and then
adjusts the clock to run at nearly this rate.  Self-tuning can
dramatically improve performance, for example, experiments showed that
self-tuned DRAMs can operate at about half the read access time of
those with conventional control.  In this paper we present experiments
showing why a self-tuned clock -- an asynchronous stop/start clock --
is preferable to a crystal clock in that the self-tuned clock can
operate closer to the maximum clocking speed of a circuit than can a
a crystal while guaranteeing glitch-free synchronization.  The first
part of this paper describes the self-tuned clock and its glitch-free
synchronization.  The second part describes experiments supporting our
contention that the self-tuned clock is better able to operate close
to the maximum speed of the circuitry than is a crystal oscillator.
The third part shows that this new clock is more efficient than a
conventional synchronizer.  The fourth part concerns arbitration for
this clock from many sources.


TR # 94-05-04  
"Using Assertions for Validating, Verifying and Monitoring Real-Time
Systems"
Sitaram C. V. Raju

Real-time systems are different from traditional computer systems
because they have timing constraints that must always be met.
Designing correct real-time systems is not an easy task because it is
simply not possible to exhaustively test any large real-time system.
In this dissertation we attempt to address this problem by using
assertions for validating, verifying and monitoring real-time systems.

We first present a prototyping environment for an executable
specification scheme called Communicating Real-Time State Machines
(CRSMs). The system behavior of CRSMs is
characterized by a time-stamped trace of communication events.  We
present a novel assertion language for specifying safety and timing
assertions on the trace of communication events. The prototyping
environment consists of a graphical editor for describing CRSMs, a
CRSM simulator to observe the execution of the prototype, and an
assertion checker for monitoring assertions during simulation.

We then describe an automatic verification technique for real-time
systems modeled by an expressive subclass of CRSMs. The proposed
approach is to represent the behavior of the system by a finite, timed
reachability graph. We provide a decision procedure for verifying
timing and safety properties (specified in our assertion language)
of the reachability graph. We describe an implementation of
the verification technique and demonstrate its usefulness on example
real-time specifications.

Even if a real-time system has been formally specified, verified and
then implemented in software it is important that the implementation
check for constraint violations states at run-time. The reason is that
unexpected conditions at run-time must not cause the real-time system
to fail. We describe a run-time environment for monitoring timing
constraints in distributed real-time systems.  Timing constraints are
specified in a subset of our assertion language. Constraint violations
are detected at the earliest possible time by deriving and checking
intermediate constraints from the user-specified constraints. 

The major contributions of this dissertation are: a method for
checking CRSM simulations, an automatic verification technique for
CRSMs, and the development of a run-time monitor for real-time
applications. Future work includes the development of a scalable
verification method and experimenting with large real-time
applications. 


TR # 94-05-05  
"A Performance Evaluation of Lock-Free Synchronization Protocols"
Anthony LaMarca

In this paper, we investigate the practical performance of lock-free
techniques that provide synchronization on shared-memory
multiprocessors. Our goal is to provide a technique to allow designers
of new protocols to quickly determine an algorithm's performance
characteristics.  We develop a simple analytical performance model
based on the architectural observations that memory accesses are
expensive, synchronization instructions are more expensive, and that
optimistic synchronization policies result in wasted communication
bandwidth which can slow the system as a whole.  Using our model, we
evaluate the performance of five existing lock-free synchronization
protocols.  We validate our analysis by comparing our results with
simulations of a parallel machine.  Given this analysis, we identify
those protocols which show promise of good performance in practice.
In addition, we note that no existing protocols provide insensitivity
to common delays while still offering performance equivalent to locks.
Accordingly, we introduce a protocol, based on a combination of
existing lock-free techniques which satisfies these criteria.


TR # 94-05-06
"The Design of an Interactive Visualization Tool for Exploring Image
Convolution"
Alex Rothenberg

Much of traditional mathematics education has revolved around the
teaching of formal terminology leaving students unaware of its context
or relevance.  This paper argues that visualization can be used as a
technique to expose the concepts behind the terminology.  The paper
focuses on how visualization can be applied to image convolution, then
describes "the convolver", a tool applying these ideas implemented
under Microsoft Windows.  Using high level graphical interfaces, it is
possible for students to directly experience the concepts bypassing
much of the formal terminology.  Finally, possible uses of the
convolver are presented within both a university level image
processing course and a K-12 mathematics curriculum.


TR # 94-05-07 *
"A Self-Accelerating Packet Service Discipline for Low-Delay Service
to Bursty Flows"
Ricardo Pincheira

In this paper, we consider the transmission of compressed video over
networks.  We develop a packet service discipline called Axel that
provides very low service delays to bursty flows while providing
guaranteed throughput to all flows in the system.  In contrast to
existing policies capable of providing bandwidth guarantees, Axel can
provide low delay service to bursty flows without the need for
resource over-reservation.  We study the behavior of Axel through
analysis and simulation, and conclude that Axel meets its stated
goals.  The policy is targeted at a network environment composed of a
mixture of bursty, delay sensitive traffic and smoother, delay
insensitive traffic.  We expect such an environment to arise from a
mixture of interactive, compressed video streams (eg.,
tele-conferencing) and non-interactive compressed video streams (eg.,
video-on-demand).