Fall Quarter 1997, Seminar Series
The speakers and topics of seminars that will be held during fall
are available here.
The remaining schedule for the quarter, shall be shortly posted on the
Web.
Parallel Algorithms for Power Estimation
Victor J. Kim
Advisee of Dr. Prithviraj Banerjee,
Graduate student,
Northwestern University,
Illinois, USA.
Abstract
The need for efficient and reliable power estimation tools is evident
in today's low-power, high-performance driven VLSI design process.
Currently, several techniques exist for estimating power of
combinational and sequential circuits using lengthy simulation, Monte
Carlo simulation, and probabilistic methods. Simulation-based
methods at the logic level can be highly accurate but often require
long runtimes. This study presents parallelization schemes for these
techniques in the context of distributed-memory multiprocessing
systems. In pattern-partitioning schemes, the workload of the driving
input patterns is distributed across the processors. In circuit-
partitioning schemes, the circuit itself is distributed across the
processors and simulation is performed in a pipelined fashion. Dynamic
load balancing heuristic is presented for re-distributing the circuit
during runtime. With either partitioning scheme, the quality of
the serial-run result is maintained while achieving good speed-ups.
Death and Taxes, Nets and Caches:
Facing Inevitabilities in Parallel PDE Simulation
Dr. David E. Keyes
Old Dominion University &
ICASE, NASA Langley Research Center,
Virginia, USA.
Abstract
Demands for massive memory and high speed typically accompany one
another in scientific and engineering computations, linking space to
time in algorithm design. Some degree of programmer control must be
exerted over data layout in coding for scalable distributed memory
machines (even when the memory is accessible through the model of a
global shared address space). Fortunately, the laws of nature often
cooperate with a basic scaling law of computer architecture: the
magnitude of interaction between two degrees of freedom in a physical
system decays with their spatial separation; therefore, the frequency
and volume of data exchange between different points in the
computational domain can be allowed to decay with distance in a
trade-off involving memory access overhead and the precision required
in a final result (or the rate of convergence required from a
preconditioner). For model problems, this trade-off has been
formalized in convergence theorems. We have been exploring it
primarily experimentally, applying domain decomposition
preconditioners to problems in computational aerodynamics, and
"defying" Amdahl's Law --- up to a point --- through the benefits of
cache locality.
Modern iterative domain decomposition (or ``substructuring'') methods
rely predominantly upon local information. Nonlocal information is
accessed hierarchically, through exchanges requiring only a small
volume of data relative to the scale of the discretization. Hence,
domain decomposition is a natural algorithmic bridge between
application and architecture for problems whose convergence is
dominated by ellipticity, such as the full potential operator, the
Stokes operator, or multicomponent systems including pressure. This
seminar begins with a motivation for parallel implicit algorithms,
briefly summarizes some domain-decomposition (additive Schwarz)
convergence theorems for elliptic and hyperbolic scalar problems, and
illustrates the tuning of Newton-Krylov-Schwarz algorithmic
parameters (subdomain granularity, subdomain solution accuracy,
subdomain overlap width, and coarse-grid density) on a variety
of CFD formulations, ideal and industrial.
Biographical sketch
With an undergraduate concentration in Mechanical Engineering
(Princeton, 1978), graduate degrees in Applied Mathematics (Harvard,
1979/84), and post-doctoral and faculty positions in Computer Science
and Mechanical Engineering at Yale (1984-93), David Keyes joined the
Computer Science Department at Old Dominion University in 1993. He is
also an Associate Research Fellow at the Institute for Computer
Applications in Science and Engineering (ICASE) at the nearby NASA
Langley Research Center, where he has consulted for the past ten years.
He is the founding director of a new Virginia/ICASE/LaRC Program in
High Performance Computing and Communication, a four-campus doctoral
fellowship program centered at NASA Langley. His research, touching a
variety of CFD applications, has been centered on the development and
parallel implementation of domain decomposition iterative methods for
systems of PDEs.
Home page: http://www.cs.odu.edu/~keyes/keyes.html
Parallel Complexity of Coprimality and related Arithmetic Problems
Dr. Bruce Litow
Department of Computer Science,
James Cook University,
Townsville, Australia.
Abstract
After the basic arithmetic operations, arguably the
most important arithmetic function is the greatest common
divisor (GCD). GCD is related to other problems:
1) Coprimality. Are two integers coprime?
2) Modular Inversion. If A and B are coprime, then find integers
X and Y such that AX - BY = 1.
3) Extended GCD. Find integers X and Y such that AX - BY =
GCD(A,B).
The Euclidean algorithm actually solves the extended GCD in
sequential (?) time. However, the situation for parallel time
complexity is far from satisfactory. In the first part of the
talk we will survey what is currently known about the parallel
time complexity of these problems. In the second part, we will
sketch an NC algorithm for coprimality. The algorithm uses Neff's
result on NC approximation of the zeros of a rational polynomial.
Biographical sketch
Dr. Bruce Litow is currently a senior lecturer in the computer science
department of James Cook University in Townsville, Australia.
Previously he was an assistant professor in the EECS department
of the University of Wisconsin - Milwaukee. His principle interests
are parallel and sequential complexity theory and formal language
theory. Until January 31 he will be visiting the CS department of the
University of Central Florida in Orlando.
Distributed Heterogenous Databases
Dr. Lawrence J. Henschen
Department of Electrical and Computer Engineering,
Northwestern University,
Illinois, USA.
Abstract
In addition to studying the details of how distributed systems
themselves work, it is useful to be aware of various applications
that run on distributed computing systems and the issues, both
systems and application oriented, that must be addressed to make
these applications successful.
We describe one such application, namely distributed heterogenous
databases. The focus of the presentation will be on heterogeneity,
the different types of heterogeneity and the approaches
that have been proposed to address these various kinds of
heterogeneity. The presentation does not address system-level
aspects. We differentiate syntactic and semantic heterogeneity,
describe some approaches taken by others to handle syntactic
heterogeneity and then describe our own research into
methods for solving semantic heterogeneity.
This is a joint work with C. Fernandes and T. Neild.
Biographical sketch
Dr. Henschen received the BA, MA and PhD degrees from the
University of Illinois at Urbana-Champaign in 1966, 1968 and 1971
respectively. His main areas of research center on automated
reasoning, particularly resolution and related inference
mechanisms, and applications to database technology. His early
work on Horn theories had significant application later to logic
programming and deductive databases. He has participated in the
development of many experimental reasoning systems. He was
chairman of the first international meeting on automated theorem
proving, which led to the CADE series of conferences. He and his
students were the first to offer feasible solutions to the
recursion and integrity problems in deductive databases. Dr. Henschen
has produced almost 60 PhD students since 1971. He serves on the
editorial board for two major journals and on the program committees
of the major conferences for automated reasoning. He has published
more than 100 journal and conference papers.
Evaluation of Compiler and run-time Library Approaches for Supporting
Parallel Regular Applications
Dhruva R. Chakrabarti
Advisee of Dr. Prithviraj Banerjee,
Graduate student,
Northwestern University,
Illinois, USA.
Abstract
Important applications including those in computational chemistry,
computational fluid dynamics, structural analysis and sparse matrix
applications are frequently modeled using iterations on unstructured
grids. This lack of structure makes it impossible for compilers to
generate efficient code for multi-processors with distributed address
space since the communication pattern can be determined only at
run-time. This problem has been traditionally handled by using
run-time libraries which generate optimized communication (equivalent
to those generated by compilers for regular accesses) as soon as the
communication pattern is known. However, in each of these
applications, there is usually a combination of both regular and
irregular accesses. While current state-of-the-art support for such
applications handles the irregular accesses reasonably well, the
efficacy of the optimizations at run-time for the regular accesses is
yet to be proven. Our study aims to find out a better approach to
handle the above applications in a unified compiler and run-time
framework.
The seminar begins with a motivation behind optimization for
irregularly structured code. The salient features of PILAR are
presented. The talk then focuses upon the need to efficiently
optimize the regular accesses within irregular applications. We
evaluate the performance of two run-time libraries, namely PILAR and
PILAR using an enumerated representation (equivalent to the CHAOS
approach), and a commercial HPF compiler for purely regular
applications. The aim is to determine the extent of overhead that is
usually associated with run-time resolution of regular accesses when
compared to a commercial compiler. We show that using a particular
representation of regular accesses, the performance of regular code
using run-time libraries can come close to the performance of code
generated by a compiler. The results indicate that a run-time library
incorporating multiple representations for efficient resolution of
both regular and irregular accesses will perform better than
conventional techniques. The results hint at possible amortization of
some of the run-time overhead by moving some of the run-time
complexities into a compiler which generates code annotated with calls
to a run-time library for an application containing both regular and
irregular accesses.
Tempest: A Substrate for Portable Parallel Programs
Dr. Mark D. Hill
Computer Sciences Department,
University of Wisconsin,
1210 West Dayton Street, Madison,
Wisconsin, USA.
Abstract
Are parallel programs portable? Today the answer is "no". Massively
parallel processors (MPPs) are programmed with explicit message
passing, networks of workstations (NOWs) with sockets, and symmetric
multiprocessors (SMPs) with shared memory. To be portable, parallel
applications must run across this range of machines without source-level
changes. To make this possible, we have developed the Tempest
interface, which provides mechanisms to support shared-memory,
message-passing, and hybrid communications. With these mechanisms, a
low-cost workstation cluster can provide the same communications
abstractions (e.g., shared memory) as high-performance parallel
machines.
The Tempest interface consists of low-level communication and
memory-system mechanisms. User-level software implements specific
policies, such as shared-memory protocols. Programmers and compilers
can customize a policy to fit an application's semantics and sharing
patterns. Experiments show that in some cases custom cache-coherence
policies can produce an order-of-magnitude performance improvement,
without restructuring a program's source code.
We have studied several Tempest implementations. The first, Typhoon,
offers high performance with first-class hardware support. This
proposed hardware implements Tempest using a fully-programmable,
user-level processor in the network interface. The second system,
Blizzard, demonstrates Tempest's portability by implementing the same
interface on unmodified stock hardware: both a Thinking Machines CM-5
and the Wisconsin COW (a Cluster Of Workstations).
Finally, our third system, Typhoon-Zero, serves as both a mid-range
implementation and a prototype of the more complex Typhoon hardware.
Typhoon-Zero adds a simple hardware module to the Wisconsin COW nodes
to accelerate Tempest's fine-grain access control mechanism.
This work is part of the Wisconsin Wind Tunnel project, co-led by Mark
Hill, James Larus, and David Wood. For more information, see URL
http://www.cs.wisc.edu/~wwt.
Biographical sketch
Mark D. Hill (http://www.cs.wisc.edu/~markhill) is Associate Professor
and Romnes Fellow in both the Computer Sciences Department and the
Electrical and Computer Engineering Department at the University of
Wisconsin-Madison. He earned a Ph.D. from Berkeley in 1987, won an
NSF Presidential Young Investigator award in 1989, and is a Director
of ACM SIGARCH. His work targets the memory systems of shared-memory
multiprocessors and high-performance uniprocessors. He currently
co-directs the Wisconsin Wind Tunnel parallel computing project with
Profs. James Larus and David Wood.
I/O Issues in Parallel Object-Relational Databases
Sachin S. More
Advisee of Dr. Alok Choudhary,
Graduate student,
Northwestern University,
Illinois, USA.
Abstract
The object-relational database model provides the unique facility to
add functionality to the database engine to tailor it for any
application domain. The geo-spatial application domain is
an interesting class of applications which require significant compute
and storage resources. Recent research has shown that parallel
object-relational databases can be used profitably to provide the
necessary computing resources for geo-spatial applications.
This talk focuses on the I/O related issues in parallel
object-relational databases.
Geo-spatial datasets typically contain raster images, each of which
is several megabytes in size. Queries process these images by
applying operators and functions on them. Hence there is a need for
efficient data storage and retrieval strategies. This talk will
discuss some of these issues. Initial experimental results will also
be presented.
Scalable, High-Performance Parallel Computing for Combinatorial
Optimization Problems
Dr. Shantanu Dutt
Department of Electrical Engineering and Computer Science,
University of Illinois at Chicago,
Illinois, USA.
Abstract
In this talk, I will present new strategies for scalable
high-performance parallel best-first search (BFS) algorithms for
solving combinatorial optimization problems. BFS is a widely used
strategy for solving many optimization problems. In parallel BFS,
inefficiencies such as processor starvation and search of non-essential
spaces (those not explored by the sequential algorithm) grow with the
number of processors used (P), thus restricting its scalability. To
alleviate this effect, we propose an efficient parallel startup phase
and a novel dynamic load balancing method called the quality equalizing
(QE) strategy. The QE strategy possesses certain unique quantitative
and qualitative load balancing properties that allow it to
significantly reduce starvation and non-essential work, and also
``anticipate'' these events before they occur. This yields a highly
scalable parallel BFS algorithm with an almost-linear speedup. The
startup and load balancing methods were employed in parallel BFS
algorithms to solve Traveling Salesman Problems (TSPs) and Mixed
Integer Programming (MIP) on the nCUBE2 and IBM SP-2 multicomputers.
The QE strategy yields speedup improvements of about 20-185% and
15-120% over three well known load balancing methods---the round-robin
(RR), the random communication (RC) and the neighborhood averaging (NA)
strategies. The average speedup observed on the nCUBE2 with 1024
processors is about 985, representing a very high efficiency of 0.96.
We also analyze and empirically evaluate the scalability of parallel
BFS algorithms in terms of the isoefficiency metric. Our analysis
obtains the following tight bounds: (1) a \Theta(P log P) lower bound
for any parallel BFS algorithm, and (2) a general expression for the
upper bound on any topology---for the hypercube and 2-D mesh
architectures the upper bounds on the isoefficiency function are found
to be \Theta(P log^{2} P) and \Theta(P\sqrt{P}), respectively.
Experimental results validate our analysis, and show that parallel BFS
using the QE load balancing strategy has better scalability than other
techniques.
In on-going work, we are investigating an adaptive load balancing
framework QE* that can automatically adapt its communication frequency
and locality of direct work transfers to application and parallel
platform characteristics in order to obtain high performance across a
wide range of applications and parallel machines (MPPs, SMP clusters
and networks of workstations).
Biographical sketch
Shantanu Dutt received the B.E. degree in electronics and communication
engineering from the M.S. University of Baroda, India, in 1983, the
M.Tech. degree in computer engineering from the Indian Institute of
Technology, Kharagpur, India in 1984, and the Ph.D. degree in computer
science and engineering from the University of Michigan, Ann Arbor, in
1990. He is currently an associate professor at the Department of
Electrical Engineering and Computer Science, University of
Illinois at Chicago.
His current technical interests include CAD for VLSI/ULSI circuits,
parallel and distributed computing, fault-tolerant computing and
computer architecture.
Compiler Algorithms for Optimizing Locality and Parallelism on
Shared Memory and Distributed Memory machines.
Mahmut T. Kandemir
Advisee of Dr. Alok Choudhary,
Graduate student,
Northwestern University,
Illinois, USA.
Abstract
Distributed-memory message-passing machines deliver scalable
performance but are difficult to program. Shared-memory machines, on
the other hand, are easier to program but obtaining a scalable
performance with large number of processors is difficult. Recently,
some scalable architectures based on logically-shared
physically-distributed memory have been designed and
implemented. While some of the performance issues like parallelism and
locality are common to the different parallel architectures, issues
such as data decomposition are unique to specific types of
architectures. One of the most important challenges compiler writers
face is to design compilation techniques that can work on a variety of
architectures. In this talk, I propose an algorithm that can be
employed by optimizing compilers for different types of parallel
architectures. The optimization algorithm does the following: (1)
transforms loop nests such that, where possible, the outermost loop
can be run in parallel across processors; (2) decomposes each array
across processors; (3) optimizes interprocessor communication by
vectorizing it whenever possible; and (4) optimizes locality (cache
performance) by assigning appropriate storage layout for each
array. Depending on the underlying hardware system, some or all of
these steps can be applied in a unified framework.
This is a joint work with Dr. Choudhary and Dr. Ramanujam.