Spring Quarter 1997, Seminar Series


The speakers and topics of seminars that were held during spring are available here.
  • April 7 : Jian Chen (jchen@ece.nwu.edu)
  • April 14 : Gagan Hasteer (hasteer@crhc.uiuc.edu)
  • April 21 : Mahmut Kandemir (mtk@ece.nwu.edu)
  • April 28 : Jonathan Geisler (geisler@ece.nwu.edu)
  • May 5, 10 a.m. : Prof. Alok N. Choudhary (choudhar@ece.nwu.edu)
  • May 12 : Professor V. Taylor (taylor@ece.nwu.edu)
  • May 19 : Tom Cormen, Dartmouth College, Department of Computer Science
  • May 26 : Memorial Day (No seminar)
  • June 2 : Ian Foster, Mathematics & Computer Science Division, Argonne National Laboratory

Data Decomposition for Distributed Computing

Abstract

Distributed computing consists of a platform with a network of resources; these resources maybe personal computers, workstations, or parallel machines. Further, the resources maybe located at one site or distributed among different sites. Distributed systems provide an economical alternative to costly massively parallel computers. In this talk, we present a data decomposition tool, called PART, for distributed systems. PART takes into consideration the variance in performance of the networks and processors as well as the computational complexity and heterogeneity of the application. This is achieved via the parameters used in the objective function of simulated annealing. The initial version of PART focuses on finite element based problems. The results of using PART demonstrate a 20% reduction in execution time as compare to using conventional schemes that equally partition the problem domain.


Parallel Algorithms for State Assignment of Finite State Machines

Abstract

Simulated Annealing has been an effective tool in many optimization problems in VLSI CAD but its time requirements are prohibitive. In this presentation we will describe a parallel algorithm for a well established, simulated annealing based algorithm for the state assignment problem for finite state machines.

Our parallel annealing strategy uses parallel moves by multiple processes, each performing local moves within its assigned sub-space of the state encoding space. The novelty is in the dynamic repartitioning of the state space among processors, so that each processor gets to perform moves on the entire space over time. This is important to keep the quality of the parallel algorithm comparable to the serial algorithm. On the average our algorithm gives quality results within 0.05% of the serial algorithm on 64 processors.

Our algorithm is portable across a wide range of MIMD machines and gives superlinear speedups on all of them. For a large circuit, the runtime has been reduced from 11 hours to 10 minutes on a 64 processor machine.


Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs on Distributed Memory Machines

Abstract

The use of massively parallel machines to solve large scale computational problems in physics, chemistry, and other sciences has increased considerably in recent times. Many of these problems have computational requirements which stretch the capabilities of even the fastest supercomputer available today. In addition to requiring a great deal of computational power, these problems usually deal with large quantities of data up to a few terabytes. Main memories are not large enough to hold this much amount of data; so data needs to be stored on disks and fetched during the execution of the program. Unfortunately, the performance of the I/O subsystems of massively parallel computers has not kept pace with their processing and communication capabilities. Hence, the performance bottleneck is the time taken to perform disk I/O.

I will talk about optimization techniques for translating out-of-core programs written in a data parallel language to message passing node programs with explicit parallel I/O. I will describe how the compiler can optimize the code by (1) determining appropriate file layouts for out-of-core arrays, (2) permuting the loops in the nest(s) to allow efficient file access, and (3) partitioning the available node memory among references based on I/O cost estimation.


Abstract

In the future, we expect parallel machines to have deep memory hierarchies, with an intricate interconnection network between heterogeneous processors. It also seems likely that some processors may be associated with memory (e.g., PIM, IRAM, etc.).

We are serveying different models of parallel computation in an effort to develop a model as a system analysis tool. Such a model will allow an algorithm designer to choose the system that best matches the developed solution, or for system designers to analyze the performance of a new design before it is built.

We present one such model named the Processor Memory Hierarchy (PMH) and also discuss how well its goals match the designs of future machines. It is based on a fat tree network of communication and is the only model we have found that includes both the memory hierarchy and the interconnection network.


Design and Implementation of a High-Performance Media-on-Demand Server

Abstract

High performance servers and high speed networks will form the backbone of the infrastructure required for distributed multimedia information systems. A server for an interactive distributed multimedia system may require thousands of gigabytes of storage space and high I/O bandwidth. An architectural model of a server for such a system is developed. Parallelism of data retrieval is achieved by striping the data across multiple disks. The admission control policy for accepting a new request at steady state is presented. The performance of any server ultimately depends on the data access patterns. Modifications of the basic retrieval algorithm that exploit data access patterns in order to improve system throughput are presented. The results of a detailed, component-wise instrumentation of an implementation of the server model are presented.

In order to maximize system utilization, and thus minimize cost, it is essential that the load be balanced among each of the server's components viz. the disks, the interconnection network and the scheduler. Dynamic allocation policies for improving server capacity are presented. These policies assign media requests to the nodes of the server so as to balance the load on the interconnection network and the scheduling nodes. Five policies for request assignment, Round Robin (RR), Minimum Link Allocation (MLA), Minimum Contention Allocation (MCA), Weighted Minimum Link Allocation (WMLA) and Weighted Minimum Contention Allocation (WMCA) are developed. We also address the issues of file replication for achieving load balancing among the disks in the storage subsystem, and the effect of varying the concurrency of data retrieval from the storage subsystem. The performance of the dynamic allocation policies on an implementation of the server model is evaluated.


The HPAM Architecture for Petaflops Applications

Abstract

The hierarchical processors-and-memory (HPAM) architecture uses commercial-off-the-shelf microprocessors to provide a cost-efficient machine for petaflop applications. This heterogeneous multiprocessor system is organized as a hierarchy of processors-and-memory subsystems. Across the different levels, the processor speeds and interconnection technology vary. The top level of HPAM consists of a small number of expensive, fast processors; the lower levels consist of a large number of inexpensive, slower processors.

Significant work has been done to demonstrate the advantages of the HPAM architecture for various applications. This work includes in-depth application studies along with the development of an efficient simulator, HPAM_Sim. HPAM_Sim was used to analyze the cost-efficiency advantages of mapping the CMU benchmarks onto HPAM versus using an optimal homogeneous machine. The study quantifies the gains that can be achieved with using a two or three level HPAM machine for mid-range to high-range budgets. Further, in depth analysis of an implicit, finite element code, used to analyze industrial problems, has demonstrated significant advantages, in terms of effectiveness, of using the HPAM architecture for which the different levels can be used to match the different degrees of parallelism that occurs throughout execution. Similar trends were found with another industrial application, a seismic application, which had several degrees of parallelism and communication. The seismic study also confirmed that there is locality with respect to the degree of parallelism; hence, the communication across the different HPAM levels is low. In this presentation we will present the details of the various application studies and summarize the results.

This is joint work with Prof. Jose' Fortes, Prof. Rudolf Eigenman, Zina Ben-Miled and Brian Armstrong at Purdue University and Meena Kandaswamy and Shai Eisen at Northestern University.


Performing Out-of-Core FFTs on Parallel Disk Systems

Tom Cormen

Dartmouth College, Department of Computer Science

Abstract

The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although in most cases, 1-dimensional FFT problem sizes are small enough (at most 100,000 points or so) to allow them to be computed in memory, there is another class of problems for which the problem size can be so large that it exceeds the memory capacity of even large supercomputers.

In this talk, I will discuss how to perform 1-dimensional FFTs on both uniprocessors and distributed-memory multiprocessors when the data reside on a parallel disk system with independent disk accesses.

For uniprocessors, I will present both analytical and experimental results for performing out-of-core FFTs in two ways: using traditional virtual memory with demand paging, and using an explicit out-of-core algorithm developed for the Parallel Disk Model of Vitter and Shriver. When using virtual memory with demand paging, FFT performance is extremely poor once the problem size exceeds the memory size. In particular, almost every data access during the initial bit-reversal permutation induces a page fault. Experimental results indicate that the explicit out-of-core algorithm not only outperforms the in-core methods running with demand paging (sometimes quite dramatically), but that it can also compete with the in-core methods even when they run in-core.

Joint work with Jake Wegmann and David Nicol


Designing and Building Parallel Climate Models

Ian Foster

Argonne National Laboratory

University of Chicago

Abstract

The impossibility of meaningful experiments makes numerical modeling a particularly important tool in studies of the earth's climate system. Driven by the need for greater physical resolution, physical realism, and timescales, climate models have historically been among the first applications to exploit new computer architectures. In recent years, a number of groups have addressed the problem of transitioning climate models from vector multiprocessors to scalable parallel computers. In this talk, I describe one such effort, a joint project of Argonne and other laboratories and universities. This project has produced new numerical methods and parallel algorithms suitable for use on parallel computers, and incorporated these new techniques in models capable of executing with high efficiency on multiple parallel platforms. My goal is to describe not only the numerical and algorithmic techniques required to build parallel climate models, but also the software engineering issues that arise when dealing with systems of this complexity. In many respects, the latter problems are harder than the former.