Spring Quarter 1997, Seminar Series
The speakers and topics of seminars that were held during spring
are available here.
- April 7 : Jian Chen (jchen@ece.nwu.edu)
- April 14 : Gagan Hasteer (hasteer@crhc.uiuc.edu)
- April 21 : Mahmut Kandemir (mtk@ece.nwu.edu)
- April 28 : Jonathan Geisler (geisler@ece.nwu.edu)
- May 5, 10 a.m. : Prof. Alok N. Choudhary (choudhar@ece.nwu.edu)
- May 12 : Professor V. Taylor (taylor@ece.nwu.edu)
- May 19 : Tom Cormen, Dartmouth College, Department of Computer Science
- May 26 : Memorial Day (No seminar)
- June 2 : Ian Foster, Mathematics & Computer Science Division, Argonne National Laboratory
|
Data Decomposition for Distributed Computing
Abstract
Distributed computing consists of a platform with a network of resources;
these resources maybe personal computers, workstations, or parallel machines.
Further, the resources maybe located at one site or distributed among
different sites. Distributed systems provide an economical alternative
to costly massively parallel computers. In this talk, we present a
data decomposition tool, called PART, for distributed systems. PART takes
into consideration the variance in performance of the networks and
processors as well as the computational complexity and heterogeneity
of the application. This is achieved via the parameters used in the
objective function of simulated annealing. The initial version of
PART focuses on finite element based problems. The results
of using PART demonstrate a 20% reduction in execution
time as compare to using conventional schemes that equally
partition the problem domain.
Parallel Algorithms for State Assignment of Finite State Machines
Abstract
Simulated Annealing has been an effective tool in many optimization
problems in VLSI CAD but its time requirements are prohibitive.
In this presentation we will describe a parallel algorithm for a well
established, simulated annealing based algorithm
for the state assignment problem for finite state machines.
Our parallel annealing strategy uses parallel moves by multiple
processes, each performing local moves within its assigned sub-space of
the state encoding space. The novelty is in the dynamic repartitioning of
the state space among processors, so that each processor gets to perform
moves on the entire space over time. This is important to keep the quality
of the parallel algorithm comparable to the serial algorithm. On the average
our algorithm gives quality results within 0.05% of the serial algorithm
on 64 processors.
Our algorithm is portable across a wide range of MIMD machines and
gives superlinear speedups on all of them. For a large circuit, the runtime
has been reduced from 11 hours to 10 minutes on a 64 processor machine.
Data Access Reorganizations in Compiling Out-of-Core Data
Parallel Programs on Distributed Memory Machines
Abstract
The use of massively parallel machines to solve large scale
computational problems in physics, chemistry, and other
sciences has increased considerably in recent times. Many
of these problems have computational requirements which stretch the
capabilities of even the fastest supercomputer available today.
In addition to requiring a great deal of computational power, these
problems usually deal with large quantities of data up to a few terabytes.
Main memories are not large enough to hold this much
amount of data; so data needs to be stored on disks and fetched during
the execution of the program. Unfortunately, the performance of the
I/O subsystems of massively parallel computers has not kept pace with
their processing and communication capabilities. Hence, the
performance bottleneck is the time taken to perform disk I/O.
I will talk about optimization techniques for translating out-of-core programs
written in a data parallel language to message passing node programs with
explicit parallel I/O. I will describe how the compiler can optimize
the code by (1) determining appropriate file layouts for out-of-core
arrays, (2) permuting the loops in the nest(s) to allow efficient file
access, and (3) partitioning the available node memory among references
based on I/O cost estimation.
Abstract
In the future, we expect parallel machines to have deep memory
hierarchies, with an intricate interconnection network between
heterogeneous processors. It also seems likely that some processors may
be associated with memory (e.g., PIM, IRAM, etc.).
We are serveying different models of parallel computation in an effort
to develop a model as a system analysis tool. Such a model will allow
an algorithm designer to choose the system that best matches the
developed solution, or for system designers to analyze the performance
of a new design before it is built.
We present one such model named the Processor Memory Hierarchy (PMH) and
also discuss how well its goals match the designs of future machines.
It is based on a fat tree network of communication and is the only model
we have found that includes both the memory hierarchy and the
interconnection network.
Design and Implementation of a High-Performance Media-on-Demand Server
Abstract
High performance servers and high speed networks will form the
backbone of the infrastructure required for distributed multimedia
information systems. A server for an interactive distributed
multimedia system may require thousands of gigabytes of storage space
and high I/O bandwidth. An architectural model of a server for such a
system is developed. Parallelism of data retrieval is achieved by
striping the data across multiple disks. The admission control policy
for accepting a new request at steady state is presented. The
performance of any server ultimately depends on the data access
patterns. Modifications of the basic retrieval algorithm that exploit
data access patterns in order to improve system throughput are
presented. The results of a detailed, component-wise instrumentation
of an implementation of the server model are presented.
In order to maximize system utilization, and thus minimize cost, it is
essential that the load be balanced among each of the server's
components viz. the disks, the interconnection network and the
scheduler. Dynamic allocation policies for improving server capacity
are presented. These policies assign media requests to the nodes of
the server so as to balance the load on the interconnection network
and the scheduling nodes. Five policies for request assignment, Round
Robin (RR), Minimum Link Allocation (MLA), Minimum Contention
Allocation (MCA), Weighted Minimum Link Allocation (WMLA) and Weighted
Minimum Contention Allocation (WMCA) are developed. We also address
the issues of file replication for achieving load balancing among the
disks in the storage subsystem, and the effect of varying the
concurrency of data retrieval from the storage subsystem. The
performance of the dynamic allocation policies on an implementation of
the server model is evaluated.
The HPAM Architecture for Petaflops Applications
Abstract
The hierarchical processors-and-memory (HPAM) architecture
uses commercial-off-the-shelf microprocessors to provide a
cost-efficient machine for petaflop applications. This
heterogeneous multiprocessor system is organized as a
hierarchy of processors-and-memory subsystems. Across the
different levels, the processor speeds and interconnection
technology vary. The top level of HPAM consists of a small
number of expensive, fast processors; the lower levels consist
of a large number of inexpensive, slower processors.
Significant work has been done to demonstrate the advantages of
the HPAM architecture for various applications. This work includes
in-depth application studies along with the development of an efficient
simulator, HPAM_Sim. HPAM_Sim was used to analyze the cost-efficiency
advantages of mapping the CMU benchmarks onto HPAM versus using an
optimal homogeneous machine. The study quantifies the gains that can
be achieved with using a two or three level HPAM machine for mid-range
to high-range budgets. Further, in depth analysis of an implicit, finite
element code, used to analyze industrial problems, has demonstrated
significant advantages, in terms of effectiveness, of using the HPAM
architecture for which the different levels can be used to match the
different degrees of parallelism that occurs throughout execution.
Similar trends were found with another industrial application, a seismic
application, which had several degrees of parallelism and communication.
The seismic study also confirmed that there is locality with respect
to the degree of parallelism; hence, the communication across
the different HPAM levels is low.
In this presentation we will present the details of the various
application studies and summarize the results.
This is joint work with Prof. Jose' Fortes, Prof. Rudolf Eigenman,
Zina Ben-Miled and Brian Armstrong at Purdue University and
Meena Kandaswamy and Shai Eisen at Northestern University.
Performing Out-of-Core FFTs on Parallel Disk Systems
Tom Cormen
Dartmouth College, Department of Computer Science
Abstract
The Fast Fourier Transform (FFT) plays a key role in many areas of
computational science and engineering. Although in most cases,
1-dimensional FFT problem sizes are small enough (at most 100,000
points or so) to allow them to be computed in memory, there is another
class of problems for which the problem size can be so large that it
exceeds the memory capacity of even large supercomputers.
In this talk, I will discuss how to perform 1-dimensional FFTs on both
uniprocessors and distributed-memory multiprocessors when the data
reside on a parallel disk system with independent disk accesses.
For uniprocessors, I will present both analytical and experimental
results for performing out-of-core FFTs in two ways: using traditional
virtual memory with demand paging, and using an explicit out-of-core
algorithm developed for the Parallel Disk Model of Vitter and Shriver.
When using virtual memory with demand paging, FFT performance is
extremely poor once the problem size exceeds the memory size. In
particular, almost every data access during the initial bit-reversal
permutation induces a page fault. Experimental results indicate that
the explicit out-of-core algorithm not only outperforms the in-core
methods running with demand paging (sometimes quite dramatically), but
that it can also compete with the in-core methods even when they run
in-core.
Joint work with Jake Wegmann and David Nicol
Designing and Building Parallel Climate Models
Ian Foster
Argonne National Laboratory
University of Chicago
Abstract
The impossibility of meaningful experiments makes numerical modeling a
particularly important tool in studies of the earth's climate system.
Driven by the need for greater physical resolution, physical realism,
and timescales, climate models have historically been among the first
applications to exploit new computer architectures. In recent years,
a number of groups have addressed the problem of transitioning climate
models from vector multiprocessors to scalable parallel computers. In
this talk, I describe one such effort, a joint project of Argonne and
other laboratories and universities. This project has produced new numerical
methods and parallel algorithms suitable for use on parallel computers,
and incorporated these new techniques in models capable of executing with
high efficiency on multiple parallel platforms. My goal is to describe
not only the numerical and algorithmic techniques required to build
parallel climate models, but also the software engineering issues that
arise when dealing with systems of this complexity. In many respects,
the latter problems are harder than the former.