Fall Quarter 1997, Seminar Series

The speakers and topics of seminars that will be held during fall are available here.

The remaining schedule for the quarter, shall be shortly posted on the Web.

Parallel Algorithms for Power Estimation

Victor J. Kim
Advisee of Dr. Prithviraj Banerjee,
Graduate student,
Northwestern University,
Illinois, USA.

Abstract

The need for efficient and reliable power estimation tools is evident in today's low-power, high-performance driven VLSI design process. Currently, several techniques exist for estimating power of combinational and sequential circuits using lengthy simulation, Monte Carlo simulation, and probabilistic methods. Simulation-based methods at the logic level can be highly accurate but often require long runtimes. This study presents parallelization schemes for these techniques in the context of distributed-memory multiprocessing systems. In pattern-partitioning schemes, the workload of the driving input patterns is distributed across the processors. In circuit- partitioning schemes, the circuit itself is distributed across the processors and simulation is performed in a pipelined fashion. Dynamic load balancing heuristic is presented for re-distributing the circuit during runtime. With either partitioning scheme, the quality of the serial-run result is maintained while achieving good speed-ups.

Death and Taxes, Nets and Caches:
Facing Inevitabilities in Parallel PDE Simulation

Dr. David E. Keyes
Old Dominion University &
ICASE, NASA Langley Research Center,
Virginia, USA.

Abstract

Demands for massive memory and high speed typically accompany one another in scientific and engineering computations, linking space to time in algorithm design. Some degree of programmer control must be exerted over data layout in coding for scalable distributed memory machines (even when the memory is accessible through the model of a global shared address space). Fortunately, the laws of nature often cooperate with a basic scaling law of computer architecture: the magnitude of interaction between two degrees of freedom in a physical system decays with their spatial separation; therefore, the frequency and volume of data exchange between different points in the computational domain can be allowed to decay with distance in a trade-off involving memory access overhead and the precision required in a final result (or the rate of convergence required from a preconditioner). For model problems, this trade-off has been formalized in convergence theorems. We have been exploring it primarily experimentally, applying domain decomposition preconditioners to problems in computational aerodynamics, and "defying" Amdahl's Law --- up to a point --- through the benefits of cache locality.

Modern iterative domain decomposition (or ``substructuring'') methods rely predominantly upon local information. Nonlocal information is accessed hierarchically, through exchanges requiring only a small volume of data relative to the scale of the discretization. Hence, domain decomposition is a natural algorithmic bridge between application and architecture for problems whose convergence is dominated by ellipticity, such as the full potential operator, the Stokes operator, or multicomponent systems including pressure. This seminar begins with a motivation for parallel implicit algorithms, briefly summarizes some domain-decomposition (additive Schwarz) convergence theorems for elliptic and hyperbolic scalar problems, and illustrates the tuning of Newton-Krylov-Schwarz algorithmic parameters (subdomain granularity, subdomain solution accuracy, subdomain overlap width, and coarse-grid density) on a variety of CFD formulations, ideal and industrial.

Biographical sketch

With an undergraduate concentration in Mechanical Engineering (Princeton, 1978), graduate degrees in Applied Mathematics (Harvard, 1979/84), and post-doctoral and faculty positions in Computer Science and Mechanical Engineering at Yale (1984-93), David Keyes joined the Computer Science Department at Old Dominion University in 1993. He is also an Associate Research Fellow at the Institute for Computer Applications in Science and Engineering (ICASE) at the nearby NASA Langley Research Center, where he has consulted for the past ten years. He is the founding director of a new Virginia/ICASE/LaRC Program in High Performance Computing and Communication, a four-campus doctoral fellowship program centered at NASA Langley. His research, touching a variety of CFD applications, has been centered on the development and parallel implementation of domain decomposition iterative methods for systems of PDEs.

Home page: http://www.cs.odu.edu/~keyes/keyes.html

Parallel Complexity of Coprimality and related Arithmetic Problems

Dr. Bruce Litow
Department of Computer Science,
James Cook University,
Townsville, Australia.

Abstract

After the basic arithmetic operations, arguably the most important arithmetic function is the greatest common divisor (GCD). GCD is related to other problems: 1) Coprimality. Are two integers coprime? 2) Modular Inversion. If A and B are coprime, then find integers X and Y such that AX - BY = 1. 3) Extended GCD. Find integers X and Y such that AX - BY = GCD(A,B).

The Euclidean algorithm actually solves the extended GCD in sequential (?) time. However, the situation for parallel time complexity is far from satisfactory. In the first part of the talk we will survey what is currently known about the parallel time complexity of these problems. In the second part, we will sketch an NC algorithm for coprimality. The algorithm uses Neff's result on NC approximation of the zeros of a rational polynomial.

Biographical sketch

Dr. Bruce Litow is currently a senior lecturer in the computer science department of James Cook University in Townsville, Australia. Previously he was an assistant professor in the EECS department of the University of Wisconsin - Milwaukee. His principle interests are parallel and sequential complexity theory and formal language theory. Until January 31 he will be visiting the CS department of the University of Central Florida in Orlando.

Distributed Heterogenous Databases

Dr. Lawrence J. Henschen
Department of Electrical and Computer Engineering,
Northwestern University,
Illinois, USA.

Abstract

In addition to studying the details of how distributed systems themselves work, it is useful to be aware of various applications that run on distributed computing systems and the issues, both systems and application oriented, that must be addressed to make these applications successful.

We describe one such application, namely distributed heterogenous databases. The focus of the presentation will be on heterogeneity, the different types of heterogeneity and the approaches that have been proposed to address these various kinds of heterogeneity. The presentation does not address system-level aspects. We differentiate syntactic and semantic heterogeneity, describe some approaches taken by others to handle syntactic heterogeneity and then describe our own research into methods for solving semantic heterogeneity.

This is a joint work with C. Fernandes and T. Neild.

Biographical sketch

Dr. Henschen received the BA, MA and PhD degrees from the University of Illinois at Urbana-Champaign in 1966, 1968 and 1971 respectively. His main areas of research center on automated reasoning, particularly resolution and related inference mechanisms, and applications to database technology. His early work on Horn theories had significant application later to logic programming and deductive databases. He has participated in the development of many experimental reasoning systems. He was chairman of the first international meeting on automated theorem proving, which led to the CADE series of conferences. He and his students were the first to offer feasible solutions to the recursion and integrity problems in deductive databases. Dr. Henschen has produced almost 60 PhD students since 1971. He serves on the editorial board for two major journals and on the program committees of the major conferences for automated reasoning. He has published more than 100 journal and conference papers.

Evaluation of Compiler and run-time Library Approaches for Supporting Parallel Regular Applications

Dhruva R. Chakrabarti
Advisee of Dr. Prithviraj Banerjee,
Graduate student,
Northwestern University,
Illinois, USA.

Abstract

Important applications including those in computational chemistry, computational fluid dynamics, structural analysis and sparse matrix applications are frequently modeled using iterations on unstructured grids. This lack of structure makes it impossible for compilers to generate efficient code for multi-processors with distributed address space since the communication pattern can be determined only at run-time. This problem has been traditionally handled by using run-time libraries which generate optimized communication (equivalent to those generated by compilers for regular accesses) as soon as the communication pattern is known. However, in each of these applications, there is usually a combination of both regular and irregular accesses. While current state-of-the-art support for such applications handles the irregular accesses reasonably well, the efficacy of the optimizations at run-time for the regular accesses is yet to be proven. Our study aims to find out a better approach to handle the above applications in a unified compiler and run-time framework.

The seminar begins with a motivation behind optimization for irregularly structured code. The salient features of PILAR are presented. The talk then focuses upon the need to efficiently optimize the regular accesses within irregular applications. We evaluate the performance of two run-time libraries, namely PILAR and PILAR using an enumerated representation (equivalent to the CHAOS approach), and a commercial HPF compiler for purely regular applications. The aim is to determine the extent of overhead that is usually associated with run-time resolution of regular accesses when compared to a commercial compiler. We show that using a particular representation of regular accesses, the performance of regular code using run-time libraries can come close to the performance of code generated by a compiler. The results indicate that a run-time library incorporating multiple representations for efficient resolution of both regular and irregular accesses will perform better than conventional techniques. The results hint at possible amortization of some of the run-time overhead by moving some of the run-time complexities into a compiler which generates code annotated with calls to a run-time library for an application containing both regular and irregular accesses.

Tempest: A Substrate for Portable Parallel Programs

Dr. Mark D. Hill
Computer Sciences Department,
University of Wisconsin,
1210 West Dayton Street, Madison,
Wisconsin, USA.

Abstract

Are parallel programs portable? Today the answer is "no". Massively parallel processors (MPPs) are programmed with explicit message passing, networks of workstations (NOWs) with sockets, and symmetric multiprocessors (SMPs) with shared memory. To be portable, parallel applications must run across this range of machines without source-level changes. To make this possible, we have developed the Tempest interface, which provides mechanisms to support shared-memory, message-passing, and hybrid communications. With these mechanisms, a low-cost workstation cluster can provide the same communications abstractions (e.g., shared memory) as high-performance parallel machines.

The Tempest interface consists of low-level communication and memory-system mechanisms. User-level software implements specific policies, such as shared-memory protocols. Programmers and compilers can customize a policy to fit an application's semantics and sharing patterns. Experiments show that in some cases custom cache-coherence policies can produce an order-of-magnitude performance improvement, without restructuring a program's source code.

We have studied several Tempest implementations. The first, Typhoon, offers high performance with first-class hardware support. This proposed hardware implements Tempest using a fully-programmable, user-level processor in the network interface. The second system, Blizzard, demonstrates Tempest's portability by implementing the same interface on unmodified stock hardware: both a Thinking Machines CM-5 and the Wisconsin COW (a Cluster Of Workstations). Finally, our third system, Typhoon-Zero, serves as both a mid-range implementation and a prototype of the more complex Typhoon hardware. Typhoon-Zero adds a simple hardware module to the Wisconsin COW nodes to accelerate Tempest's fine-grain access control mechanism.

This work is part of the Wisconsin Wind Tunnel project, co-led by Mark Hill, James Larus, and David Wood. For more information, see URL http://www.cs.wisc.edu/~wwt.

Biographical sketch

Mark D. Hill (http://www.cs.wisc.edu/~markhill) is Associate Professor and Romnes Fellow in both the Computer Sciences Department and the Electrical and Computer Engineering Department at the University of Wisconsin-Madison. He earned a Ph.D. from Berkeley in 1987, won an NSF Presidential Young Investigator award in 1989, and is a Director of ACM SIGARCH. His work targets the memory systems of shared-memory multiprocessors and high-performance uniprocessors. He currently co-directs the Wisconsin Wind Tunnel parallel computing project with Profs. James Larus and David Wood.

I/O Issues in Parallel Object-Relational Databases

Sachin S. More
Advisee of Dr. Alok Choudhary,
Graduate student,
Northwestern University,
Illinois, USA.

Abstract

The object-relational database model provides the unique facility to add functionality to the database engine to tailor it for any application domain. The geo-spatial application domain is an interesting class of applications which require significant compute and storage resources. Recent research has shown that parallel object-relational databases can be used profitably to provide the necessary computing resources for geo-spatial applications. This talk focuses on the I/O related issues in parallel object-relational databases.

Geo-spatial datasets typically contain raster images, each of which is several megabytes in size. Queries process these images by applying operators and functions on them. Hence there is a need for efficient data storage and retrieval strategies. This talk will discuss some of these issues. Initial experimental results will also be presented.

Scalable, High-Performance Parallel Computing for Combinatorial Optimization Problems

Dr. Shantanu Dutt
Department of Electrical Engineering and Computer Science,
University of Illinois at Chicago,
Illinois, USA.

Abstract

In this talk, I will present new strategies for scalable high-performance parallel best-first search (BFS) algorithms for solving combinatorial optimization problems. BFS is a widely used strategy for solving many optimization problems. In parallel BFS, inefficiencies such as processor starvation and search of non-essential spaces (those not explored by the sequential algorithm) grow with the number of processors used (P), thus restricting its scalability. To alleviate this effect, we propose an efficient parallel startup phase and a novel dynamic load balancing method called the quality equalizing (QE) strategy. The QE strategy possesses certain unique quantitative and qualitative load balancing properties that allow it to significantly reduce starvation and non-essential work, and also ``anticipate'' these events before they occur. This yields a highly scalable parallel BFS algorithm with an almost-linear speedup. The startup and load balancing methods were employed in parallel BFS algorithms to solve Traveling Salesman Problems (TSPs) and Mixed Integer Programming (MIP) on the nCUBE2 and IBM SP-2 multicomputers. The QE strategy yields speedup improvements of about 20-185% and 15-120% over three well known load balancing methods---the round-robin (RR), the random communication (RC) and the neighborhood averaging (NA) strategies. The average speedup observed on the nCUBE2 with 1024 processors is about 985, representing a very high efficiency of 0.96.

We also analyze and empirically evaluate the scalability of parallel BFS algorithms in terms of the isoefficiency metric. Our analysis obtains the following tight bounds: (1) a \Theta(P log P) lower bound for any parallel BFS algorithm, and (2) a general expression for the upper bound on any topology---for the hypercube and 2-D mesh architectures the upper bounds on the isoefficiency function are found to be \Theta(P log^{2} P) and \Theta(P\sqrt{P}), respectively. Experimental results validate our analysis, and show that parallel BFS using the QE load balancing strategy has better scalability than other techniques.

In on-going work, we are investigating an adaptive load balancing framework QE* that can automatically adapt its communication frequency and locality of direct work transfers to application and parallel platform characteristics in order to obtain high performance across a wide range of applications and parallel machines (MPPs, SMP clusters and networks of workstations).

Biographical sketch

Shantanu Dutt received the B.E. degree in electronics and communication engineering from the M.S. University of Baroda, India, in 1983, the M.Tech. degree in computer engineering from the Indian Institute of Technology, Kharagpur, India in 1984, and the Ph.D. degree in computer science and engineering from the University of Michigan, Ann Arbor, in 1990. He is currently an associate professor at the Department of Electrical Engineering and Computer Science, University of Illinois at Chicago.

His current technical interests include CAD for VLSI/ULSI circuits, parallel and distributed computing, fault-tolerant computing and computer architecture.

Compiler Algorithms for Optimizing Locality and Parallelism on Shared Memory and Distributed Memory machines.

Mahmut T. Kandemir
Advisee of Dr. Alok Choudhary,
Graduate student,
Northwestern University,
Illinois, USA.

Abstract

Distributed-memory message-passing machines deliver scalable performance but are difficult to program. Shared-memory machines, on the other hand, are easier to program but obtaining a scalable performance with large number of processors is difficult. Recently, some scalable architectures based on logically-shared physically-distributed memory have been designed and implemented. While some of the performance issues like parallelism and locality are common to the different parallel architectures, issues such as data decomposition are unique to specific types of architectures. One of the most important challenges compiler writers face is to design compilation techniques that can work on a variety of architectures. In this talk, I propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. The optimization algorithm does the following: (1) transforms loop nests such that, where possible, the outermost loop can be run in parallel across processors; (2) decomposes each array across processors; (3) optimizes interprocessor communication by vectorizing it whenever possible; and (4) optimizes locality (cache performance) by assigning appropriate storage layout for each array. Depending on the underlying hardware system, some or all of these steps can be applied in a unified framework.

This is a joint work with Dr. Choudhary and Dr. Ramanujam.

The Illinois Aggressive COMA Multiprocessor Project (I-ACOMA)

Dr. Josep Torrellas
Department of Computer Science,
University of Illinois at Urbana-Champaign,
Illinois, USA.

Abstract

The goal of the I-ACOMA project is to explore how to design a highly programmable, high-performance scalable multiprocessor. In this talk, I will describe some of our most recent results and our research directions.

The progressive integration of processor and memory has unexpected implications for the design of DSM systems. To exploit this integration best, in the I-ACOMA multiprocessor, we have redesigned the nodes and reorganized the whole machine. The machine is composed of (future off-the-shelf) processor-in-memory chips. While these chips are of a single kind, some of them are used for computing purposes, while others are used for coherence protocol processing and to backup the application's data. However, the machine has the ability to reconfigure itself to adapt to the demands of the application. In addition, fast off-the-shelf processors-in-memory can be used effectively in an IRAM (intelligent memory) manner, especially for database applications. Finally, the machine has support for a novel hardware scheme for speculative parallelization.

If time permits, I will also describe some of the compiler work on data forwarding in I-ACOMA and database application support.

Biographical sketch

Dr. Josep Torrellas is an Assistant Professor at the Computer Science Department of the University of Illinois at Urbana-Champaign. He received a Ph.D. in Electrical Engineering from Stanford University in 1992. Professor Torrellas's primary interests are in the design of hardware and software for scalable shared-memory multiprocessors. He is currently leading the design of the Illinois Aggressive Cache Only Memory Architecture (I-ACOMA), a novel scalable shared-memory multiprocessor. Professor Torrellas received the NSF Research Initiation Award in 1993 and the NSF Young Investigator Award in 1994.

Parallel Algorithms for Power Estimation

Abstract

Death and Taxes, Nets and Caches: Facing Inevitabilities in Parallel PDE Simulation

Abstract

Biographical sketch

Parallel Complexity of Coprimality and related Arithmetic Problems

Abstract

Biographical sketch

Distributed Heterogenous Databases

Abstract

Biographical sketch

Evaluation of Compiler and run-time Library Approaches for Supporting Parallel Regular Applications

Abstract

Tempest: A Substrate for Portable Parallel Programs

Abstract

Biographical sketch

I/O Issues in Parallel Object-Relational Databases

Abstract

Scalable, High-Performance Parallel Computing for Combinatorial Optimization Problems

Abstract

Biographical sketch

Compiler Algorithms for Optimizing Locality and Parallelism on Shared Memory and Distributed Memory machines.

Abstract

The Illinois Aggressive COMA Multiprocessor Project (I-ACOMA)

Abstract

Biographical sketch

Death and Taxes, Nets and Caches:
Facing Inevitabilities in Parallel PDE Simulation