%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% ieeecomp95.ps
%
"The PARADIGM Compiler for Distributed-Memory Multicomputers"
 P. Banerjee, J. A. Chandy, M. Gupta, E. W. Hodges IV, J. G. Holm, A. Lain,
 D. J. Palermo, S. Ramaswamy, and E. Su.
 in IEEE Computer, Vol. 28, No. 10, pages 37-47, October 1995.

A flexible compiler framework for distributed-memory multicomputers
automatically parallelizes sequential programs.  A unified approach
efficiently supports regular and irregular computations using data and
functional parallelism.

Massively parallel distributed-memory multicomputers can achieve the high
performance levels required to solve the Grand Challenge computational
science problems (a class of computational applications, identified by the
1992 US Presidential Initiative in High-Performance Computing and
Communications, that would require a significant increase in computing
power).  Multicomputers such as the Intel Paragon, the IBM SP-1/SP-2
(Scalable PowerParallel 1 and 2) and the Thinking Machines CM-5
(Connection Machine 5) offer significant cost and scalability advantages
over shared-memory multiprocessors.  However, to harness these machines'
computation power, users must write efficient software.  This process is
laborious because of the absence of global address space.  The programmer
must manually distribute computations and data across processors and
explicitly manage communication.  The Paradigm (Parallelizing Compiler
for Distributed-Memory, General-Purpose Multicomputers) project at the
University of Illinois addresses this problem by developing automatic
methods for efficient parallelization of sequential programs.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% iwpp94.ps
%
"The PARADIGM Compiler for Distributed-Memory Message Passing Multicomputers"
 P. Banerjee, J. A. Chandy, M. Gupta, J. G. Holm, A. Lain, D. J. Palermo,
 S. Ramaswamy, and E. Su.
 in Proceedings of the First International Workshop on Parallel Processing,
 pages 322-330, Bangalore, India, December 1994.

The PARADIGM compiler project provides an automated means to parallelize
programs, written in a serial programming model, for efficient execution on
distributed-memory multicomputers.  In addition to performing traditional
compiler optimizations, PARADIGM is unique in that it addresses many other
issues within a unified platform: automatic data distribution, synthesis
of high-level communication, communication optimizations, irregular
computations, functional and data parallelism, and multithreaded execution.
This paper describes the techniques used and provides experimental
evidence of their effectiveness.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% fdc94.bbrc.ps
%
"Compiler Assisted Synthesis of Algorithm-Based Checking in Multiprocessors"
 P. Banerjee, V. Balasubramanian, and A. Roy Chowdhury
 in "Foundations of Dependable Computing: Vol III, System Implementation",
 Gary Koob editor, pages 159-211, Kluwer Academic Publishers, 1994.

In this section we describe a compile-time approach to synthesizing
algorithm-based checks for numerical programs. The compiler is used to
identify linear transformations within loops and for restructuring nonlinear
program statements to introduce more linearity into the program. The data
manipulated linear statements are then checked by the introduction of
checksums, which can often be done more cheaply than replication. We discuss
the implementation of a source-to-source restructuring compiler based on the
above approach, and present results of applying this compiler to routines
from LINPACK, EISPACK and the Perfect Benchmark Suite.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% icpp90.bb.ps
%
"Approximate Algorithms for the Partitionable Independent Task Scheduling
 Problem"
 K. P. Belkhale and P. Banerjee
 in Proceedings of the 19th International Conference on Parallel Processing,
 pages 72-75, St. Charles, IL, August 1990.

Scheduling a collection of tasks on a multiprocessor consisting of p
processors, that minimizes the maximum completion time has attracted a lot
of attention in the literature.  In this paper, we introduce a new problem
of scheduling a collection of independent tasks on a multiprocessor, called
the partitionable independent task scheduling problem.  Associated with
each task, we are given the time it takes to run on a uniprocessor, and
speedup that can be obtained by running it on i processors, 1 <= i <= p.
We present an approximate algorithm that guarantees a solution within
2/(1 + 1/p) of the optimal solution, under a reasonable assumption on the
speedup functions.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% ipps91.bb.ps
%
"A Scheduling Algorithm for Parallelizable Dependent Tasks"
 K. P. Belkhale and P. Banerjee
 in Proceedings of the International Parallel Processing Symposium,
 pages 500-506, Anaheim, CA, April 1991.

Scheduling a collection of tasks on a multiprocessor, consisting of p
processors, that minimizes the maximum completion time has attracted a lot
of attention in the literature.  In this paper, we introduce a new problem
of scheduling a task graph on a multiprocessor, called the parallelizable
dependent task scheduling problem.  Associated with each task, we are given
the time it takes to run on a uniprocessor, and the speedup that can be
obtained by running it on i processors, 1 <= i <= p.  We present an
algorithm for the problem and analyze the performance.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% dmcc91.gb.ps
%
"Automatic Data Partitioning on Distributed Memory Multiprocessors"
 Manish Gupta and Prithviraj Banerjee
 Sixth Distributed Memory Computing Conference, Portland, OR, April 1991.

An important problem facing numerous research projects on parallelizing
compilers for distributed memory machines is that of automatically
determining a suitable data partitioning scheme for a program. Most of the
current projects leave this tedious problem almost entirely to the user.
In this paper, we present a novel approach to the problem of automatic
data partitioning. We introduce the notion of constraints on data
distribution, and show how, based on performance considerations, a compiler
identifies constraints to be imposed on the distribution of various data
structures.  These constraints are then combined by the compiler to obtain
a complete and consistent picture of the data distribution scheme, one that
offers good performance in terms of the overall execution time.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% crhc9116.gb.ps
%
"Compile-Time Estimation of Communication Costs in Multicomputers"
 Manish Gupta and Prithviraj Banerjee
 Technical Report CRHC-91-16/UILU-ENG-91-2226, Center for Reliable and
 High-Performance Computing, University of Illinois, Urbana, IL, May 1991,
 (revised, August 1992).

An important problem facing numerous research projects on parallelizing 
compilers for distributed memory machines is that of automatically
determining a suitable data partitioning scheme for a program. Any strategy
for automatic data partitioning needs a mechanism for estimating the
performance of a program under a given partitioning scheme, the most crucial
part of which involves determining the communication costs incurred by the
program.  In this paper, we describe a methodology for statically estimating
the communication times as certain functions of the array sizes and the
numbers of processors over which various arrays are distributed.  We present
an algorithm to detect when communication may be taken out of a loop, and
also identify places in the program where transformations such as loop
distribution and loop interchange should be performed to enable combining
of messages.  For certain loops with constant dependences, the compiler can
detect the possibility of pipelining, and take the overlapping into account
to get more accurate cost estimates.  These results are of great
significance to any parallelization system supporting numeric applications
on multicomputers.  In particular, they lay down a framework for effective
synthesis of communication on multicomputers from sequential program
references.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% ipps92.gb.ps
%
"Compile-Time Estimation of Communication Costs on Multicomputers"
 Manish Gupta and Prithviraj Banerjee
 in Proceedings of 6th International Parallel Processing Symposium, pages
 470-475, Beverly Hills, CA, March 1992.

An important problem facing numerous research projects on parallelizing
compilers for distributed memory machines is that of automatically
determining a suitable data partitioning scheme for a program. Any strategy
for automatic data partitioning needs a mechanism for estimating the
performance of a program under a given partitioning scheme, the most
crucial part of which involves determining the communication costs incurred
by the program.  In this paper, we describe a methodology for statically
estimating the communication times as certain functions of the array sizes
and the numbers of processors over which various arrays are distributed.
This work also lays down a framework for effective synthesis of
communication on multicomputers from sequential program references.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% tpds92.gb.ps
%
"Demonstration of Automatic Data Partitioning Techniques for Parallelizing 
 Compilers on Multicomputers"
 Manish Gupta and Prithviraj Banerjee
 IEEE Transactions on Parallel and Distributed Systems, 3(2):179-193,
 March 1992.

An important problem facing numerous research projects on parallelizing
compilers for distributed memory machines is that of automatically
determining a suitable data partitioning scheme for a program. Most of
the current projects leave this tedious problem almost entirely to the
user. In this paper, we present a novel approach to the problem of automatic
data partitioning. We introduce the notion of constraints on data
distribution, and show how, based on performance considerations, a compiler
identifies constraints to be imposed on the distribution of various data
structures.  These constraints are then combined by the compiler to obtain
a complete and consistent picture of the data distribution scheme, one that
offers good performance in terms of the overall execution time. We present
results of a study we performed on Fortran programs taken from the Linpack
and Eispack libraries and the Perfect Benchmarks to determine the
applicability of our approach to real programs. The results are very
encouraging, and demonstrate the feasibility of automatic data partitioning
for programs with regular computations that may be statically analyzed, which
covers an extremely significant class of scientific application programs.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% ics92.gb.ps
%
"A Methodology for High-Level Synthesis of Communication on Multicomputers"
 Manish Gupta and Prithviraj Banerjee
 in Proceedings of the Sixth ACM International Conference on Supercomputing,
 pages 357-367, Washington D.C., July 1992.

Freeing the user from the tedious task of generating explicit communication 
is one of the primary goals of numerous research projects on compilers for
distributed memory machines. In the process of synthesis of communication,
the effective use of collective communication routines offers a considerable
scope for improving the program performance.  This paper presents a
methodology for determining the collective communication primitives that
should be used for implementing the data movement at various points in the
program. We introduce the notion of certain synchronous properties between 
array references in statements inside loops, and present tests to determine
the presence of these properties. These tests enable the compiler to analyze
quite precisely the communication requirements of those statements, and
implement communication using appropriate primitives. These results not only
lay down a framework for synthesis of communication on multicomputers, they
also form the basis of our implementation of a system that statically
estimates the communication costs of programs on multicomputers.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% phd92.g.ps
%
"Automatic Data Partitioning on Distributed Memory Multicomputers"
 Manish Gupta
 Ph.D. Dissertation, University of Illinois at Urbana-Champaign
 PhD thesis, Department of Computer Science, University of Illinois, Urbana
 IL, September 1992, CRHC-92-19/UILU-ENG-92-2237.
 Center for Reliable and High-Performance Computing
 Advisor: Prithviraj Banerjee

Distributed-memory parallel computers are increasingly being used to
provide high levels of performance for scientific applications.
Unfortunately, such machines are not very easy to program.  A number
of research efforts seek to alleviate this problem by developing compilers
that take over the task of generating communication.  The communication
overheads and the extent of parallelism exploited in the resulting target
program are determined largely by the manner in which data is partitioned
across different processors of the machine.  Most of the compilers provide
no assistance to the programmer in the crucial task of determining a good
data partitioning scheme.

This thesis presents a novel approach, the constraint-based approach, to
the problem of automatic data partitioning for numeric programs.  In this
approach, the compiler identifies some desirable requirements on the
distribution of various arrays being referenced in each statement, based
on performance considerations.  These desirable requirements are referred
to as constraints. For each constraint, the compiler determines a quality
measure that captures its importance with respect to the performance of
the program. The quality measure is obtained through static performance
estimation, without actually generating the target data-parallel program
with explicit communication. Each data distribution decision is taken by
combining all the relevant constraints.  The compiler attempts to resolve
any conflicts between constraints such that the overall execution time of
the parallel program is minimized.

This approach has been implemented as part of a compiler called PARADIGM,
that accepts Fortran 77 programs, and specifies the partitioning scheme to
be used for each array in the program. We have obtained results on some
programs taken from the Linpack and Eispack libraries, and the Perfect
Benchmarks.  These results are quite promising, and demonstrate the
feasibility of automatic data partitioning for a significant class of
scientific application programs with regular computations.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% ics93.gb.ps
%
"PARADIGM: A Compiler for Automatic Data Distribution on Multicomputers"
 Manish Gupta and Prithviraj Banerjee
 in Proceedings of the 7th ACM International Conference on Supercomputing,
 Tokyo, Japan, July 1993.

One of the most challenging steps in developing a parallel program for a
distributed memory machine is determining how data should be distributed
across processors.  Most of the compilers being developed to make it easier
to program such machines still provide no assistance to the programmer in
this difficult and machine-dependent task.  We have developed PARADIGM, a
compiler that makes data partitioning decisions for Fortran 77 procedures.
A significant feature of the design of PARADIGM is the decomposition of
the data partitioning problem into a number of sub-problems, each dealing
with a different distribution parameter for all the arrays.  This paper
presents the algorithms that, in conjunction with the computational and
the communication cost estimators developed by us, determine those
distribution parameters.  We also present results obtained on Fortran
procedures taken from the Linpack and Eispack libraries, and the Perfect
Benchmarks.  We believe these are the first results demonstrating the
success of automatic data partitioning on a significant class of Fortran
procedures.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% jopl94.gb.ps
%
"Compile-Time Estimation of Communication Costs of Programs"
 Manish Gupta and Prithviraj Banerjee
 to appear in the Journal of Programming Languages, 1994.

One of the most challenging problems in compiling for distributed memory
machines is to determine how data for a program should be distributed across
processors.  Any compiler that makes data partitioning decisions needs a
mechanism for estimating communication and computational costs of programs
to compare different alternatives.  This paper presents a methodology for
estimating communication costs of programs written in global address space.
In this approach, the compiler analyzes programs before generating
communication, and yet takes into account important communication
optimizations that will be performed.  We introduce the notion of traversal
properties of array references in loops, that help identify the nature
and extent of data movement in terms of high-level communication primities.
This enables the compiler to obtain more precise information about the
global state of communication, and in a largely machine-independent manner.
The methodology described in this paper has been implemented in a compiler,
PARADIGM, that automatically determines data partitioning for Fortran programs.
The results obtained with PARADIGM confirm the importance of this analysis
for making good data partitioning decisions.  The techniques developed in
this work for recognizing communication primitives that best characterize
the data movement are quite general, and also form the basis of generation of
communication in PTRAN-II, a prototype compiler for High Performance Fortran.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% ms95.h.ps
%
"High Performance Fortran Support for the PARADIGM Compiler"
 Eugene W. Hodges IV
 Master's thesis, Department of Electrical Engineering, University of
 Illinois, Urbana, IL, October 1995.
 Center for Reliable and High-Performance Computing
 Advisor: Prithviraj Banerjee

Lacking a global address space, distributed-memory multicomputers
present a difficult programming model.  High Performance Fortran
(HPF) provides a portable, efficient environment for the programmer
writing data-parallel applications.  Research in the PARADIGM
compiler project seeks to alleviate the programmer's burden of
writing explicitly parallel applications using non-portable
libraries and language extensions by automatically compiling
sequential Fortran programs into efficient SPMD message-passing
programs.  Because the technology to automatically partition data
and detect all parallelism in a sequential program is not mature,
HPF provides a standard interface for supplying additional
information to the compiler.  The information expressed in an HPF
program is complementary to the PARADIGM compiler project.  HPF
allows the programmer to assist the compiler with the process of
efficiently compiling code that may eventually be automated, as
well as providing information that is impossible to determine at
compile time or efficiently at run-time.  This thesis presents an
overview of HPF, the implementation of HPF in the PARADIGM
compiler, and an evaluation of the compiler using several
benchmarks.  The results demonstrate the effectiveness of the
compiler in generating efficient SPMD Fortran code from the HPF
code.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% shpcc94.hlb.ps
%
"Compilation of Scientific Programs into Multithreaded and Message
 Driven Computation"
 John Holm and Antonio Lain and Prithviraj Banerjee
 in Proceedings of the 1994 Scalable High Performance Computing Conference,
 pages 518-525, Knoxville, TN, May 1994.

Many programs written in the SPMD programming model send messages
asynchronously, and block when receiving messages.  Multiple threads
can make use of the processor while other threads wait for messages.
This paper describes and evaluates two techniques for multithreading on
the nodes of distributed memory message passing systems.  One method is
a purely runtime threads package.  The second method requires the SPMD
code to be systematically transformed into message driven code which
can be run under a message driven model.  The multithreading of
scientific applications is evaluated on the iPSC2 and the CM5.  


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% ics94.lb.ps
%
"Techniques to Overlap Computation and Communication in Irregular Iterative
 Applications"
 A. Lain and P. Banerjee
 in Proceedings of the 8th ACM International Conference on Supercomputing,
 pages 236-245, Manchester, England, July 1994.

There are many applications in CFD and structural analysis that can be
more accurately modeled using unstructured grids.  Parallelization of
implicit methods for unstructured grids is a difficult and important
problem.  This paper deals with coloring techniques to overlap computation
and communication  during the solution of implicit methods on message
passing distributed memory multicomputers.  An evaluation of coloring
techniques for partitioned unstructured grids is first presented.  Results
show the importance of using partitioning information during coloring.
It is next shown that overlapping computation and communication can be
formalized as a generalized coloring problem.  Modified coloring
algorithms are used for this purpose.  The PARTI library has been extended
to support non-blocking gather-scatter operations and used in conjunction
with these algorithms.  Practicality issues are evaluated with experimental
results on an Intel Paragon multicomputer.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% ipps95.lb.ps
%
"Exploiting Spatial Regularity in Irregular Iterative Applications"
 A. Lain and P. Banerjee
 in Proceedings of the 9th International Parallel Processing Symposium,
 pages 820-827, Santa Barbara, CA, April 1995.

The increasing gap between the speed of microprocessors and memory subsystems
makes it imperative to exploit locality of reference in sequential irregular
applications.  The parallelization of such applications requires special
considerations.  Current RTS (Run-Time Support) for irregular computations
fail to exploit the fine grain regularity present in these applications,
producing unnecessary time and memory overheads.  PILAR (Parallel Irregular
Library with Application of Regularity) is a new RTS for irregular
computations that provides a variety of internal representations of
communication patterns based on their regularity; allowing for the efficient
support of a wide spectrum of regularity under a common framework.
Experimental results on the IBM SP-1 and Intel Paragon demonstrate the
validity of our approach.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% phd95.l.ps
%
"Compiler and Run-Time Support for Irregular Computations"
 A. Lain.
 PhD thesis, Department of Computer Science, University of Illinois,
 Urbana, IL, October 1995, CRHC-92-22/UILU-ENG-95-2236.
 Center for Reliable and High-Performance Computing
 Advisor: Prithviraj Banerjee

There are many important applications in computational fluid dynamics,
circuit simulation and structural analysis that can be more accurately
modeled using iterations on unstructured grids. In these problems, regular
compiler analysis for Massively Parallel Processors (MPP) with distributed
address space fails because communication can only be determined at
run-time. However, in many of these applications the communication pattern
repeats for every iteration. Therefore, equivalent optimizations to the
regular case can be achieved with a combination of run-time support (RTS)
and compiler analysis.

Moreover, many real applications have irregular access patterns with
certain structure. This structure is needed to improve sequential
performance, better exploiting the memory hierarchy. In many cases this
structure is reflected in the array subscript expressions in the form of
hybrid accesses, that is, affine combinations of array indirection and
regular accesses. We will show that when we get advantage of this
structure, we can dramatically reduce pr e-processing overheads, both in
terms of memory and time, and improve communication and single node
performance.

In order to achieve these goals we need a combination of run-time and
compiler support. Unfortunately, existing irregular run-time support
cannot get advantage of regularity and therefore, our first requirement was
to develop  a new run-time support for irregular computations, that we
called PILAR (Parallel Irregular Library with Application of Regularity).
Our second requirement was to automate the task of exploiting regularity in
irregular applications through compiler support. For this task  we extended
the PARADIGM compiler, developed at the University of Illinois, to handle
irregular problems. Experimental results will validate the efficacy of our
approach.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% icpp94.pscb.ps
%
"Communication Optimizations used in the PARADIGM Compiler for Distributed-
 Memory Multicomputers"
 Daniel J. Palermo, Ernesto Su, John A. Chandy, and Prithviraj Banerjee
 in Proceedings of the 23rd International Conference on Parallel Processing,
 pages II:1-10, St. Charles, IL, August 1994.

The PARADIGM (PARAllelizing compiler for DIstributed-memory General-purpose
Multicomputers) project at the University of Illinois provides a fully
automated means to parallelize programs, written in a serial programming
model, for execution on distributed-memory multicomputers.  To provide
efficient execution, PARADIGM automatically performs various optimizations
to reduce the overhead and idle time caused by interprocessor communication.
Optimizations studied in this paper include message coalescing, message
vectorization, message aggregation, and coarse grain pipelining.  To
separate the optimization algorithms from machine-specific details,
parameterized models are used to estimate communication and computation
costs for a given machine.  The models are also used in coarse grain
pipelining to automatically select a task granularity that balances the
available parallelism with the costs of communication.  To determine the
applicability of the optimizations on different machines, we analyzed
their performance on an Intel iPSC/860, an Intel iPSC/2, and a Thinking
Machines CM-5.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% crhc9509.pb.ps
%
"Automatic Selection of Dynamic Data Partitioning Schemes for Distributed-
 Memory Multicomputers"
 Daniel J. Palermo and Prithviraj Banerjee
 Technical Report CRHC-95-09/UILU-ENG-95-2210, Center for Reliable and
 High-Performance Computing, University of Illinois, Urbana, IL, April 1995.

For distributed memory multicomputers such as the Intel Paragon, the
IBM SP-2, the NCUBE/2, and the Thinking Machines CM-5, the quality of
the data partitioning for a given application is crucial to obtaining
high performance.  This task has traditionally been the user's
responsibility, but in recent years much effort has been directed to
automating the selection of data partitioning schemes.  Several
researchers have proposed systems that are able to produce data
distributions that remain in effect for the entire execution of an
application.  For complex programs, however, such static data
distributions may be insufficient to obtain acceptable performance.
The selection of distributions that dynamically change over the course
of a program's execution adds another dimension to the data partitioning
problem.  In this paper, we present a technique that can be used to
automatically determine which partitionings are most beneficial over
specific sections of a program while taking into account the added
overhead of performing redistribution.  This system is being built as
part of the PARADIGM (PARAllelizing compiler for DIstributed memory
General-purpose Multicomputers) project at the University of Illinois.
The complete system will provide a fully automated means to parallelize
programs written in a serial programming model obtaining high
performance on a wide range of distributed-memory multicomputers.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% lcpc95.pb.ps
%
"Automatic Selection of Dynamic Data Partitioning Schemes for Distributed-
 Memory Multicomputers"
 Daniel J. Palermo and Prithviraj Banerjee
 in the Proceedings of the 8th Workshop on Languages and Compilers for
 Parallel Computing, pages 392-406, Columbus, OH, August, 1995

For distributed-memory multicomputers such as the Intel Paragon, the
IBM SP-1/SP-2, the NCUBE/2, and the Thinking Machines CM-5, the quality of
the data partitioning for a given application is crucial to obtaining
high performance.  This task has traditionally been the user's
responsibility, but in recent years much effort has been directed to
automating the selection of data partitioning schemes.  Several
researchers have proposed systems that are able to produce data
distributions that remain in effect for the entire execution of an
application.  For complex programs, however, such static data
distributions may be insufficient to obtain acceptable performance.
The selection of distributions that dynamically change over the course
of a program's execution adds another dimension to the data partitioning
problem.  In this paper, we present a technique that can be used to
automatically determine which partitionings are most beneficial over
specific sections of a program while taking into account the added
overhead of performing redistribution.  This system is being built as
part of the PARADIGM (PARAllelizing compiler for DIstributed memory
General-purpose Multicomputers) project at the University of Illinois.
The complete system will provide a fully automated means to parallelize
programs written in a serial programming model obtaining high
performance on a wide range of distributed-memory multicomputers.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% icpp96.pshb.ps
%
"Compiler Support for Privatization on Distributed-Memory Machines"
 D. J. Palermo, E. Su, E. W. Hodges IV, and P. Banerjee
 to appear in the Proceedings of the 25th International Conference on Parallel
 Processing, Bloomingdale, IL, August 1996.

The practice of using temporary scalar or array variables to store the
results of common subexpressions presents several challenges to a
parallelizing compiler.  Not only does dependence analysis and, as a
result, parallelization suffer, but existing techniques used for
partitioning programs and generating communication for parallel execution
on distributed-memory multicomputers also tend to break down.  Techniques
that have been developed over the years to compensate for this programming
practice include scalar expansion, global forward substitution, and
privatization, each of which has its own strengths and weaknesses.
Compared to scalar expansion and global forward substitution, privatization
has the advantage of not causing an increase in memory requirements or
operation counts, but when compiling for distributed-memory machines it
causes several new problems to arise.  In this paper, we present a simple
extension to a uniform array-region analysis framework that utilizes
privatization information to partition loops and generate efficient
communication, using the owner-computes rule, in the presence of temporary
variables.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% phd96.p.ps
%
"Compiler Techniques for Optimizing Communication and Data Distribution
 for Distributed-Memory Multicomputers"
 Daniel J. Palermo
 PhD thesis, Department of Electrical and Computer Engineering, University
 of Illinois at Urbana-Champaign, Urbana, IL, June 1996, CRHC-96-09/
 UILU-ENG-96-2215.
 Center for Reliable and High-Performance Computing
 Advisor: Prithviraj Banerjee

Distributed-memory multicomputers, such as the Intel iPSC/860, the Intel
Paragon, the IBM SP-1/SP-2, the NCUBE/2, and the Thinking Machines CM-5,
offer significant advantages over shared-memory multiprocessors in terms of
cost and scalability.  However, lacking a global address space, they
present a very difficult programming model in which the user must specify
how data and computation are to be distributed across processors and
determine these sections of data to be communicated to specific
processors.  To overcome this difficulty, significant research effort has
been aimed at source-to-source parallelizing compilers for multicomputers
that relieve the programmer from the task of program partitioning and
communication generation, while the specification of data distributions has
remained a responsibility of the programmer.

The quality of the data distribution for a given application is crucial to
obtaining high performance on distributed-memory multicomputers.  By
selecting an appropriate data distribution, the amount of communication
required by an application can be dramatically reduced.  The resulting
performance using a given data distribution therefore depends on how well
the compiler can optimize the remaining communication.  In this thesis, we
present and analyze several techniques to optimize communication and, based
on the performance of these optimizations, automatically select the best
data partitioning for a given application.

Previous work in the area of optimizing data distribution used constraints
based on performance estimates (which model the communication
optimizations) to select high quality data distributions which remain in
effect for the entire execution of an application.  For complex programs,
however, such static data distributions may be insufficient to obtain
acceptable performance.  The selection of distributions that dynamically
change over the course of a program's execution (taking into account the
added overhead of performing redistribution) adds another dimension to the
data partitioning problem.  In this thesis, we present a technique that
extends the static data partitioning algorithm to automatically determine
those distributions most beneficial over specific sections of a program
while taking into account the added overhead of performing redistribution.
Finally, we also present an interprocedural data-flow framework that
performs the conversion between a distribution-based representation (as
specified by the data partitioner) and redistribution-based form (as
specified by HPF) while optimizing the amount of redistribution that must
be performed.

These techniques have been implemented as part of the PARADIGM
(PARAllelizing compiler for DIstributed memory General-purpose
Multicomputers) project at the University of Illinois.  The complete system
strives to provide a fully automated means to parallelize sequential
programs obtaining high performance on a wide range of distributed-memory
multicomputers.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% icpp93.rb.ps
%
"Processor Allocation and Scheduling of Macro Dataflow Graphs on
 Distributed Memory Multicomputers by the PARADIGM Compiler"
 Shankar Ramaswamy and Prithviraj Banerjee
 in Proceedings of the 22nd International Conference on Parallel Processing,
 pages II:134-138, St. Charles, IL, August 1993.

Functional or Control parallelism is an effective way to increase speedups 
in Multicomputers. Programs for these machines are represented by Macro 
Dataflow Graphs (MDGs) for the purpose of functional parallelism analysis
and exploitation. Algorithms for allocation and scheduling of MDGs have 
been discussed along with some analysis of their optimality. These 
algorithms attempt to minimize the execution time of any given MDG through 
exploitation of functional parallelism. Our preliminary results show their 
effectiveness over naive algorithms.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% crhc9409.rb.ps
%
"Automatic Generation of Efficient Array Redistribution Routines for
 Distributed Memory Multicomputers"
 Shankar Ramaswamy and Prithviraj Banerjee
 Technical Report CRHC-94-09/UILU-ENG-94-2213, Center for Reliable and
 High-Performance Computing, University of Illinois, Urbana, IL, June 1994.

Appropriate data distribution has been found to be critical for obtaining
good performance on Distributed Memory Multicomputers like the CM-5, Intel
Paragon and IBM SP-1.  It has also been found that some programs need to
change their distributions during execution for better performance
(redistribution).  This work focuses on automatically generating efficient
routines for redistribution.  We present a new mathematical representation
for regular distributions called PITFALLS and then discuss algorithms for
redistribution based on this representation.  One of the significant
contributions of this work is being able to handle arbitrary source and
target processor sets while performing redistribution.  Another important
contribution is the ability to handle an arbitrary number of dimensions
for the array involved in the redistribution in a scalable manner.  Our
implementation of these techniques is based on an MPI-like communication
library.  The results presented show the low overheads for our redistribution
algorithm as compared to naive runtime methods.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% crhc9410.rsb.ps
%
"A Framework for Exploiting Data and Functional Parallelism on Distributed
 Memory Multicomputers"
 Shankar Ramaswamy, Sachin Sapatnekar, and Prithviraj Banerjee
 Technical Report CRHC-94-10/UILU-ENG-94-2224, Center for Reliable and
 High-Performance Computing, University of Illinois, Urbana, IL, June 1994.

Recent research efforts have shown the benefits of integrating functional
and data parallelism over using either pure data parallelism or pure
functional parallelism.  The work in this paper presents a theoretical
framework for deciding on a good execution strategy for a given program based
on the available functional and data parallelism in the program.  The
framework is based on assumptions about the form of computation and
communication cost functions for multicomputer systems.  We present
mathematical functions for these costs and show that these functions are
realistic.  The framework also requires specification of the available
functional and data parallelism for a given problem.  For this purpose, we
have developed a graphical programming tool.  Currently, we have tested our
approach using three benchmark programs on the Thinking Machines CM-5 and
Intel Paragon.  Results presented show that the approach is very effective
and can provide a two- to three-fold increase in speedups over approaches
using only data parallelism.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% icpp94.rsb.ps
%
"A Convex Programming Approach for Exploiting Data and Functional 
 Parallelism on Distributed Memory Multicomputers"
 Shankar Ramaswamy, Sachin Sapatnekar, and Prithviraj Banerjee
 in Proceedings of the 23rd International Conference on Parallel Processing,
 pages II:116-125, St. Charles, IL, August 1994.

Compilers have focussed on the exploitation of one of functional or data 
parallelism in the past. The PARADIGM compiler project at the University of 
Illinois is among the first to incorporate techniques for simultaneous 
exploitation of both. The work in this paper describes the techniques used 
in the PARADIGM compiler and analyzes the optimality of these techniques. 
It is the first of its kind to use realistic cost models and includes data 
transfer costs which all previous researchers have neglected. Preliminary 
results on the CM-5 show the efficacy of our methods and the significant 
advantages of using functional and data parallelism together for execution 
of real applications.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% fmpc95.rb.ps
%
"Automatic Generation of Efficient Array Redistribution Routines for
 Distributed Memory Multicomputers"
 Shankar Ramaswamy and Prithviraj Banerjee
 in Proceedings of Frontiers '95: The Fifth Symposium on the Frontiers of
 Massively Parallel Computation, pages 342-349, McLean, VA, February 1995.

Appropriate data distribution has been found to be critical for obtaining
good performance on Distributed Memory Multicomputers like the CM-5, Intel
Paragon and IBM SP-1.  It has also been found that some programs need to
change their distributions during execution for better performance
(redistribution).  This work focuses on automatically generating efficient
routines for redistribution. We present a new mathematical representation
for regular distributions called PITFALLS and then discuss algorithms for
redistribution based on this representation.  A significant contribution of
this work is the ability to handle arbitrary source and target processor
sets while performing redistribution; another is the ability to handle
arbitrary dimensionality for the array being redistributed in a scalable
manner. The results presented show low overheads for our redistribution
algorithm as compared to naive runtime methods.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% ppl95.rb.ps
%
"Simultaneous Allocation and Scheduling Using Convex Programming Techniques"
 S. Ramaswamy and P. Banerjee.
 in Parallel Processing Letters, December 1995.

Simultaneous exploitation of task and data parallelism provides significant
benefits for many applications.  The basic approach for exploiting task and
data parallelism is to use a task graph representation (Macro Dataflow
Graph) for programs to decide on the degree of data parallelism to be used
for each task (allocation) and an execution order for the tasks
(scheduling).  Previously, we presented a two step approach for allocation
and scheduling by considering the two steps to be independent of each
other. In this paper, we present a new simultaneous approach which uses
constraints to model the scheduler during allocation. The new simultaneous
approach provides significant benefits over our earlier approach for the
benchmark task graphs that we have considered.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% phd96.r.ps
%
"Simultaneous Exploitation of Task and Data Parallelism in Regular
 Scientific Applications"
 Shankar Ramaswamy.
 PhD thesis, Department of Electrical and Computer Engineering, University
 of Illinois at Urbana-Champaign, Urbana, IL, January 1996.
 Center for Reliable and High-Performance Computing
 Advisor: Prithviraj Banerjee

Distributed Memory Multicomputers (DMMs) such as the IBM SP-2, the Intel
Paragon and the Thinking Machines CM-5 offer significant advantages over
shared memory multiprocessors in terms of cost and scalability.
Unfortunately, the utilization of all the available computational power in
these machines involves a tremendous programming effort on the part of
users, which creates a need for sophisticated compiler and run-time support
for distributed memory machines.  In this thesis we explore a new compiler
optimization for regular scientific applications - the simultaneous
exploitation of task and data parallelism.  Our optimization is implemented
as part of the PARADIGM HPF compiler framework and as part of a MATLAB
compiler framework we have developed. The intuitive idea behind the
optimization is the use of task parallelism to control the degree of data
parallelism of individual tasks. The reason this provides increased
performance is that data parallelism provides diminishing returns as the
number of processors used is increased. By controlling the number of
processors used for each data parallel task in an application and by
concurrently executing these tasks, we make program execution more
efficient and therefore faster. A practical implementation of a task and
data parallel scheme of execution for an application on a distributed
memory multicomputer also involves data redistribution.  This data
redistribution causes an overhead. However, as our experimental results
show, this overhead is not a problem; execution of a program using task and
data parallelism together can be significantly faster than its execution
using data parallelism alone.  This makes our proposed optimization
practical and extremely useful.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% ipps96.rhb.ps
%
"Compiling Matlab Programs to ScaLAPACK: Exploiting Task and Data Parallelism"
 Shankar Ramaswamy, Eugene W. Hodges IV, and Prithviraj Banerjee.
 in Proceedings of the International Parallel Processing Symposium, 
 Honolulu, HI, April 1996.

In this paper we suggest a new approach aimed at reducing the
effort required to program distributed-memory multicomputers.  The
key idea in our approach is to automatically convert a program
written in a library-based programming language Matlab to a
parallel program based on the ScaLAPACK parallel library.  In the
process of performing this conversion, we apply compiler
optimizations that simultaneously exploit task and data
parallelism. As our results show, our approach is feasible and
practical and our optimization provides significant performance
benefits.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% crhc9520.rb.ps
%
"Compiler Assisted Generation of Error Detecting Parallel Programs"
 A. Roy-Chowdhury and P. Banerjee.
 Technical Report CRHC-95-20/UILU-ENG-95-2231, Center for Reliable and
 High-Performance Computing, University of Illinois, Urbana, IL, August 1995.

We have developed an automated, compile time approach to generating
error-detecting parallel programs. The compiler is used to identify
statements implementing affine transformations within the program and
automatically insert code for computing, manipulating, and comparing
checksums in order to check the correctness of the code implementing affine
transformations. Statements which do not implement affine transformations
are checked by duplication.  Checksums are reused from one loop to the next
if this is possible, rather than recomputing checksums for every statement.
A global dataflow analysis is performed in order to determine points at
which checksums need to be recomputed.  We also use a novel method of
specifying the data distributions of the check data using directives
provided by the High Performance Fortran (HPF) standard so that the
computations on the original data and the corresponding check computations
are performed on different processors. Results are presented on an Intel
Paragon distributed memory multicomputer.

Keywords: Algorithm-Based Fault Tolerance, Checksum Encoding, Parallelizing
	  Compilers, Compiler Assisted Fault Tolerance.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% phd95.r.ps
%
"Manual and Compiler Assisted Methods for Generating Fault-Tolerant
 Parallel Programs"
 A. Roy-Chowdhury
 PhD thesis, Department of Electrical and Computer Engineering, University
 of Illinois at Urbana-Champaign, Urbana, IL, December 1995, CRHC-95-27/
 UILU-ENG-95-2243.
 Center for Reliable and High-Performance Computing
 Advisor: Prithviraj Banerjee

Algorithm-based fault-tolerance (ABFT) is an inexpensive method of
incorporating fault-tolerance into existing applications. Applications are
modified to operate on encoded data and produce encoded results which may
then be checked for correctness.  An attractive feature of the scheme is
that it requires little or no modification to the underlying hardware or
system software.  Previous algorithm-based methods for developing reliable
versions of numerical programs for general-purpose multicomputers have
mostly concerned themselves with error detection. A truly fault-tolerant
algorithm, however, needs to locate errors and recover from them once they
have been located. In a parallel processing environment, this corresponds
to locating the faulty processors and recovering the data corrupted by the
faulty processors. In this dissertation, we first present a general scheme
for performing fault-location and recovery under the ABFT framework.  Our
fault model assumes that a faulty processor can corrupt all of the data it
possesses.  The fault-location scheme is an application of system-level
diagnosis theory to the ABFT framework, while the fault-recovery scheme
uses ideas from coding theory to maintain redundant data and uses this to
recover corrupted data in the event of processor failures.  Results are
presented on implementations of three numerical algorithms on a distributed
memory  multicomputer, which demonstrate acceptably low overheads for the
single- and double-fault location and recovery cases.

For a class of algorithms performing affine transformations, we automate
the process of generating an error-detecting version at compile time. The
compiler is used to identify loops that perform affine transformations on
array elements. These loops are then checked by computing a checksum over
the array elements being transformed and transforming the checksums
appropriately, which typically results in much smaller overheads than
checking the entire code by duplication.  Portions of code in the program
that are not affine transformations are checked by duplication.  An
existing source-to-source compiler, Parafrase-2, has been modified to take
in programs written in High Performance Fortran (HPF) and output an
error-detecting version of the same. Data distributions for the new arrays
and checksums introduced are specified by inserting additional HPF
directives in the program. The modified program can then be input to a
parallelizer for distributed memory machines, such as PARADIGM, to obtain
an error-detecting parallel program.  We demonstrate results on three
numerical programs by executing the error-detecting versions generated by
our compiler on a distributed memory multicomputer.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% ms93.s.ps
%
"Automating Parallelization of Regular Computations for Distributed-Memory
 Multicomputers"
 E. Su.
 Master's thesis, Department of Electrical and Computer Engineering,
 University of Illinois at Urbana-Champaign, Urbana, IL, March 1993.
 Center for Reliable and High-Performance Computing
 Advisor: Prithviraj Banerjee

Distributed-memory multicomputers, such as the Intel iPSC/860, the NCUBE/2,
the Intel Paragon and the Connection Machine CM-5, offer significant
advantages over shared-memory multiprocessors in terms of cost and
scalability.  Unfortunately, to extract all of that computational power
from these machines, users have to write efficient software for them, which
is an extremely laborious process. One major reason for this difficulty is
the absence of a single global shared address space. As a result, the
programmer has to distribute code and data on processors himself, and
manage communication among tasks explicitly.  Clearly there is a need for
efficient parallel programming support on these machines.  The PARADIGM
parallelizing compiler project addresses that problem by developing an
automated means to convert programs written for a serial programming model
and compiling them for efficient execution on distributed-memory
multicomputers.  In this thesis we discuss automatic parallelization of
regular computations using symbolic sets as a representation for data
accesses and communication.  Algorithms for the computation partitioning
and communication generation are discussed.  Results on the speedup and
efficiency of some SPMD node programs generated by PARADIGM are studied to
determine the applicability of our approach.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% icpp93.spb.ps
%
"Automating Parallelization of Regular Computations for Distributed-Memory
 Multicomputers"
 Ernesto Su, Daniel J. Palermo, and Prithviraj Banerjee
 in Proceedings of the 1993 International Conference on Parallel Processing,
 pages II:30-38, St. Charles, IL, August 1993.

Distributed-memory multicomputers such as the Intel iPSC/860, the NCUBE/2,
the Intel Paragon and the Connection Machine CM-5 offer significant
advantages over shared-memory multiprocessors in terms of cost and
scalability.  Unfortunately, to extract all the computational power from
these machines, users have to parallelize their existing serial programs,
which can be an extremely laborious process.  One major reason for this
difficulty is the absence of a single global shared address space. As a
result, the programmer has to distribute code and data on processors and
manage communication among tasks explicitly.  Clearly there is a need for
efficient parallelizing compiler support on these machines. The PARADIGM
project at the University of Illinois addresses these problems by
developing a fully automated means to translate serial programs for
efficient execution on distributed-memory multicomputers. In this paper
we discuss parallelization of regular computations using symbolic sets as
a representation for data accesses and communication.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% pact94.spb.ps
%
"Processor Tagged Descriptors: A Data Structure for Compiling for Distributed-
 Memory Multicomputers"
 Ernesto Su, Daniel J. Palermo, and Prithviraj Banerjee
 in Proceedings of the 1994 International Conference on Parallel Architectures
 and Compilation Techniques, pages 123-132, Montre'al, Canada, August 1994.

The computation partitioning, communication analysis, and optimization
phases performed during compilation for distributed-memory multicomputers
require an efficient way of describing distributed sets of iterations and
regions of data.  Processor Tagged Descriptors (PTDs) provide these
capabilities through a single set representation parameterized by the
processor location for each dimension of a virtual mesh.  A uniform
representation is maintained for every processor in the mesh, whether it
is a boundary or an interior node.  As a result, operations on the sets
are very efficient because the effect on all processors in a dimension
can be captured in a single symbolic operation.  In addition, PTDs are
easily extended to an arbitrary number of dimensions, necessary for
describing iteration sets in multiply nested loops as well as sections of
multidimensional arrays.  Using the symbolic features of PTDs it is also
possible to generate code for variable numbers of processors, thereby
allowing a compiled program to run unchanged on varying sized machines.
The PARADIGM (PARAllelizing compiler for DIstributed-memory General-purpose
Multicomputers) project at the University of Illinois utilizes PTDs to
provide an automated means to parallelize serial programs for execution
on distributed-memory multicomputers.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% ics95.slrphb.ps
%
"Advanced Compilation Techniques in the PARADIGM Compiler for Distributed-
 Memory Multicomputers"
 Ernesto Su, Antonio Lain, Shankar Ramaswamy, Daniel J. Palermo,
 Eugene W. Hodges IV, and Prithviraj Banerjee
 in Proceedings of the 9th ACM International Conference on Supercomputing,
 pages 424-433, Barcelona, Spain, July 1995.

The PARADIGM compiler project provides an automated means to parallelize
programs, written in a serial programming model, for efficient execution
on distributed-memory multicomputers.  A previous implementation of the
compiler based on the PTD representation allowed symbolic array sizes,
affine loop bounds and array subscripts, and variable number of processors,
provided that arrays were single- or multi-dimensionally block distributed.
The techniques presented here extend the compiler to also accept multi-
dimensional cyclic and block-cyclic distributions within a uniform
symbolic framework.  These extensions demand more sophisticated symbolic
manipulation capabilities.  A novel aspect of our approach is to meet this
demand by interfacing PARADIGM with a powerful off-the-shelf symbolic
package, Mathematica(tm).  This paper describes some of the
Mathematica(tm) routines that performs various transformations, shows how
they are invoked and used by the compiler to overcome the new challenges,
and presents experimental results for code involving cyclic and 
block-cyclic arrays as evidence of the feasibility of the approach.