%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % ieeecomp95.ps % "The PARADIGM Compiler for Distributed-Memory Multicomputers" P. Banerjee, J. A. Chandy, M. Gupta, E. W. Hodges IV, J. G. Holm, A. Lain, D. J. Palermo, S. Ramaswamy, and E. Su. in IEEE Computer, Vol. 28, No. 10, pages 37-47, October 1995. A flexible compiler framework for distributed-memory multicomputers automatically parallelizes sequential programs. A unified approach efficiently supports regular and irregular computations using data and functional parallelism. Massively parallel distributed-memory multicomputers can achieve the high performance levels required to solve the Grand Challenge computational science problems (a class of computational applications, identified by the 1992 US Presidential Initiative in High-Performance Computing and Communications, that would require a significant increase in computing power). Multicomputers such as the Intel Paragon, the IBM SP-1/SP-2 (Scalable PowerParallel 1 and 2) and the Thinking Machines CM-5 (Connection Machine 5) offer significant cost and scalability advantages over shared-memory multiprocessors. However, to harness these machines' computation power, users must write efficient software. This process is laborious because of the absence of global address space. The programmer must manually distribute computations and data across processors and explicitly manage communication. The Paradigm (Parallelizing Compiler for Distributed-Memory, General-Purpose Multicomputers) project at the University of Illinois addresses this problem by developing automatic methods for efficient parallelization of sequential programs. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % iwpp94.ps % "The PARADIGM Compiler for Distributed-Memory Message Passing Multicomputers" P. Banerjee, J. A. Chandy, M. Gupta, J. G. Holm, A. Lain, D. J. Palermo, S. Ramaswamy, and E. Su. in Proceedings of the First International Workshop on Parallel Processing, pages 322-330, Bangalore, India, December 1994. The PARADIGM compiler project provides an automated means to parallelize programs, written in a serial programming model, for efficient execution on distributed-memory multicomputers. In addition to performing traditional compiler optimizations, PARADIGM is unique in that it addresses many other issues within a unified platform: automatic data distribution, synthesis of high-level communication, communication optimizations, irregular computations, functional and data parallelism, and multithreaded execution. This paper describes the techniques used and provides experimental evidence of their effectiveness. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % fdc94.bbrc.ps % "Compiler Assisted Synthesis of Algorithm-Based Checking in Multiprocessors" P. Banerjee, V. Balasubramanian, and A. Roy Chowdhury in "Foundations of Dependable Computing: Vol III, System Implementation", Gary Koob editor, pages 159-211, Kluwer Academic Publishers, 1994. In this section we describe a compile-time approach to synthesizing algorithm-based checks for numerical programs. The compiler is used to identify linear transformations within loops and for restructuring nonlinear program statements to introduce more linearity into the program. The data manipulated linear statements are then checked by the introduction of checksums, which can often be done more cheaply than replication. We discuss the implementation of a source-to-source restructuring compiler based on the above approach, and present results of applying this compiler to routines from LINPACK, EISPACK and the Perfect Benchmark Suite. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % icpp90.bb.ps % "Approximate Algorithms for the Partitionable Independent Task Scheduling Problem" K. P. Belkhale and P. Banerjee in Proceedings of the 19th International Conference on Parallel Processing, pages 72-75, St. Charles, IL, August 1990. Scheduling a collection of tasks on a multiprocessor consisting of p processors, that minimizes the maximum completion time has attracted a lot of attention in the literature. In this paper, we introduce a new problem of scheduling a collection of independent tasks on a multiprocessor, called the partitionable independent task scheduling problem. Associated with each task, we are given the time it takes to run on a uniprocessor, and speedup that can be obtained by running it on i processors, 1 <= i <= p. We present an approximate algorithm that guarantees a solution within 2/(1 + 1/p) of the optimal solution, under a reasonable assumption on the speedup functions. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % ipps91.bb.ps % "A Scheduling Algorithm for Parallelizable Dependent Tasks" K. P. Belkhale and P. Banerjee in Proceedings of the International Parallel Processing Symposium, pages 500-506, Anaheim, CA, April 1991. Scheduling a collection of tasks on a multiprocessor, consisting of p processors, that minimizes the maximum completion time has attracted a lot of attention in the literature. In this paper, we introduce a new problem of scheduling a task graph on a multiprocessor, called the parallelizable dependent task scheduling problem. Associated with each task, we are given the time it takes to run on a uniprocessor, and the speedup that can be obtained by running it on i processors, 1 <= i <= p. We present an algorithm for the problem and analyze the performance. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % dmcc91.gb.ps % "Automatic Data Partitioning on Distributed Memory Multiprocessors" Manish Gupta and Prithviraj Banerjee Sixth Distributed Memory Computing Conference, Portland, OR, April 1991. An important problem facing numerous research projects on parallelizing compilers for distributed memory machines is that of automatically determining a suitable data partitioning scheme for a program. Most of the current projects leave this tedious problem almost entirely to the user. In this paper, we present a novel approach to the problem of automatic data partitioning. We introduce the notion of constraints on data distribution, and show how, based on performance considerations, a compiler identifies constraints to be imposed on the distribution of various data structures. These constraints are then combined by the compiler to obtain a complete and consistent picture of the data distribution scheme, one that offers good performance in terms of the overall execution time. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % crhc9116.gb.ps % "Compile-Time Estimation of Communication Costs in Multicomputers" Manish Gupta and Prithviraj Banerjee Technical Report CRHC-91-16/UILU-ENG-91-2226, Center for Reliable and High-Performance Computing, University of Illinois, Urbana, IL, May 1991, (revised, August 1992). An important problem facing numerous research projects on parallelizing compilers for distributed memory machines is that of automatically determining a suitable data partitioning scheme for a program. Any strategy for automatic data partitioning needs a mechanism for estimating the performance of a program under a given partitioning scheme, the most crucial part of which involves determining the communication costs incurred by the program. In this paper, we describe a methodology for statically estimating the communication times as certain functions of the array sizes and the numbers of processors over which various arrays are distributed. We present an algorithm to detect when communication may be taken out of a loop, and also identify places in the program where transformations such as loop distribution and loop interchange should be performed to enable combining of messages. For certain loops with constant dependences, the compiler can detect the possibility of pipelining, and take the overlapping into account to get more accurate cost estimates. These results are of great significance to any parallelization system supporting numeric applications on multicomputers. In particular, they lay down a framework for effective synthesis of communication on multicomputers from sequential program references. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % ipps92.gb.ps % "Compile-Time Estimation of Communication Costs on Multicomputers" Manish Gupta and Prithviraj Banerjee in Proceedings of 6th International Parallel Processing Symposium, pages 470-475, Beverly Hills, CA, March 1992. An important problem facing numerous research projects on parallelizing compilers for distributed memory machines is that of automatically determining a suitable data partitioning scheme for a program. Any strategy for automatic data partitioning needs a mechanism for estimating the performance of a program under a given partitioning scheme, the most crucial part of which involves determining the communication costs incurred by the program. In this paper, we describe a methodology for statically estimating the communication times as certain functions of the array sizes and the numbers of processors over which various arrays are distributed. This work also lays down a framework for effective synthesis of communication on multicomputers from sequential program references. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % tpds92.gb.ps % "Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers" Manish Gupta and Prithviraj Banerjee IEEE Transactions on Parallel and Distributed Systems, 3(2):179-193, March 1992. An important problem facing numerous research projects on parallelizing compilers for distributed memory machines is that of automatically determining a suitable data partitioning scheme for a program. Most of the current projects leave this tedious problem almost entirely to the user. In this paper, we present a novel approach to the problem of automatic data partitioning. We introduce the notion of constraints on data distribution, and show how, based on performance considerations, a compiler identifies constraints to be imposed on the distribution of various data structures. These constraints are then combined by the compiler to obtain a complete and consistent picture of the data distribution scheme, one that offers good performance in terms of the overall execution time. We present results of a study we performed on Fortran programs taken from the Linpack and Eispack libraries and the Perfect Benchmarks to determine the applicability of our approach to real programs. The results are very encouraging, and demonstrate the feasibility of automatic data partitioning for programs with regular computations that may be statically analyzed, which covers an extremely significant class of scientific application programs. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % ics92.gb.ps % "A Methodology for High-Level Synthesis of Communication on Multicomputers" Manish Gupta and Prithviraj Banerjee in Proceedings of the Sixth ACM International Conference on Supercomputing, pages 357-367, Washington D.C., July 1992. Freeing the user from the tedious task of generating explicit communication is one of the primary goals of numerous research projects on compilers for distributed memory machines. In the process of synthesis of communication, the effective use of collective communication routines offers a considerable scope for improving the program performance. This paper presents a methodology for determining the collective communication primitives that should be used for implementing the data movement at various points in the program. We introduce the notion of certain synchronous properties between array references in statements inside loops, and present tests to determine the presence of these properties. These tests enable the compiler to analyze quite precisely the communication requirements of those statements, and implement communication using appropriate primitives. These results not only lay down a framework for synthesis of communication on multicomputers, they also form the basis of our implementation of a system that statically estimates the communication costs of programs on multicomputers. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % phd92.g.ps % "Automatic Data Partitioning on Distributed Memory Multicomputers" Manish Gupta Ph.D. Dissertation, University of Illinois at Urbana-Champaign PhD thesis, Department of Computer Science, University of Illinois, Urbana IL, September 1992, CRHC-92-19/UILU-ENG-92-2237. Center for Reliable and High-Performance Computing Advisor: Prithviraj Banerjee Distributed-memory parallel computers are increasingly being used to provide high levels of performance for scientific applications. Unfortunately, such machines are not very easy to program. A number of research efforts seek to alleviate this problem by developing compilers that take over the task of generating communication. The communication overheads and the extent of parallelism exploited in the resulting target program are determined largely by the manner in which data is partitioned across different processors of the machine. Most of the compilers provide no assistance to the programmer in the crucial task of determining a good data partitioning scheme. This thesis presents a novel approach, the constraint-based approach, to the problem of automatic data partitioning for numeric programs. In this approach, the compiler identifies some desirable requirements on the distribution of various arrays being referenced in each statement, based on performance considerations. These desirable requirements are referred to as constraints. For each constraint, the compiler determines a quality measure that captures its importance with respect to the performance of the program. The quality measure is obtained through static performance estimation, without actually generating the target data-parallel program with explicit communication. Each data distribution decision is taken by combining all the relevant constraints. The compiler attempts to resolve any conflicts between constraints such that the overall execution time of the parallel program is minimized. This approach has been implemented as part of a compiler called PARADIGM, that accepts Fortran 77 programs, and specifies the partitioning scheme to be used for each array in the program. We have obtained results on some programs taken from the Linpack and Eispack libraries, and the Perfect Benchmarks. These results are quite promising, and demonstrate the feasibility of automatic data partitioning for a significant class of scientific application programs with regular computations. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % ics93.gb.ps % "PARADIGM: A Compiler for Automatic Data Distribution on Multicomputers" Manish Gupta and Prithviraj Banerjee in Proceedings of the 7th ACM International Conference on Supercomputing, Tokyo, Japan, July 1993. One of the most challenging steps in developing a parallel program for a distributed memory machine is determining how data should be distributed across processors. Most of the compilers being developed to make it easier to program such machines still provide no assistance to the programmer in this difficult and machine-dependent task. We have developed PARADIGM, a compiler that makes data partitioning decisions for Fortran 77 procedures. A significant feature of the design of PARADIGM is the decomposition of the data partitioning problem into a number of sub-problems, each dealing with a different distribution parameter for all the arrays. This paper presents the algorithms that, in conjunction with the computational and the communication cost estimators developed by us, determine those distribution parameters. We also present results obtained on Fortran procedures taken from the Linpack and Eispack libraries, and the Perfect Benchmarks. We believe these are the first results demonstrating the success of automatic data partitioning on a significant class of Fortran procedures. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % jopl94.gb.ps % "Compile-Time Estimation of Communication Costs of Programs" Manish Gupta and Prithviraj Banerjee to appear in the Journal of Programming Languages, 1994. One of the most challenging problems in compiling for distributed memory machines is to determine how data for a program should be distributed across processors. Any compiler that makes data partitioning decisions needs a mechanism for estimating communication and computational costs of programs to compare different alternatives. This paper presents a methodology for estimating communication costs of programs written in global address space. In this approach, the compiler analyzes programs before generating communication, and yet takes into account important communication optimizations that will be performed. We introduce the notion of traversal properties of array references in loops, that help identify the nature and extent of data movement in terms of high-level communication primities. This enables the compiler to obtain more precise information about the global state of communication, and in a largely machine-independent manner. The methodology described in this paper has been implemented in a compiler, PARADIGM, that automatically determines data partitioning for Fortran programs. The results obtained with PARADIGM confirm the importance of this analysis for making good data partitioning decisions. The techniques developed in this work for recognizing communication primitives that best characterize the data movement are quite general, and also form the basis of generation of communication in PTRAN-II, a prototype compiler for High Performance Fortran. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % ms95.h.ps % "High Performance Fortran Support for the PARADIGM Compiler" Eugene W. Hodges IV Master's thesis, Department of Electrical Engineering, University of Illinois, Urbana, IL, October 1995. Center for Reliable and High-Performance Computing Advisor: Prithviraj Banerjee Lacking a global address space, distributed-memory multicomputers present a difficult programming model. High Performance Fortran (HPF) provides a portable, efficient environment for the programmer writing data-parallel applications. Research in the PARADIGM compiler project seeks to alleviate the programmer's burden of writing explicitly parallel applications using non-portable libraries and language extensions by automatically compiling sequential Fortran programs into efficient SPMD message-passing programs. Because the technology to automatically partition data and detect all parallelism in a sequential program is not mature, HPF provides a standard interface for supplying additional information to the compiler. The information expressed in an HPF program is complementary to the PARADIGM compiler project. HPF allows the programmer to assist the compiler with the process of efficiently compiling code that may eventually be automated, as well as providing information that is impossible to determine at compile time or efficiently at run-time. This thesis presents an overview of HPF, the implementation of HPF in the PARADIGM compiler, and an evaluation of the compiler using several benchmarks. The results demonstrate the effectiveness of the compiler in generating efficient SPMD Fortran code from the HPF code. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % shpcc94.hlb.ps % "Compilation of Scientific Programs into Multithreaded and Message Driven Computation" John Holm and Antonio Lain and Prithviraj Banerjee in Proceedings of the 1994 Scalable High Performance Computing Conference, pages 518-525, Knoxville, TN, May 1994. Many programs written in the SPMD programming model send messages asynchronously, and block when receiving messages. Multiple threads can make use of the processor while other threads wait for messages. This paper describes and evaluates two techniques for multithreading on the nodes of distributed memory message passing systems. One method is a purely runtime threads package. The second method requires the SPMD code to be systematically transformed into message driven code which can be run under a message driven model. The multithreading of scientific applications is evaluated on the iPSC2 and the CM5. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % ics94.lb.ps % "Techniques to Overlap Computation and Communication in Irregular Iterative Applications" A. Lain and P. Banerjee in Proceedings of the 8th ACM International Conference on Supercomputing, pages 236-245, Manchester, England, July 1994. There are many applications in CFD and structural analysis that can be more accurately modeled using unstructured grids. Parallelization of implicit methods for unstructured grids is a difficult and important problem. This paper deals with coloring techniques to overlap computation and communication during the solution of implicit methods on message passing distributed memory multicomputers. An evaluation of coloring techniques for partitioned unstructured grids is first presented. Results show the importance of using partitioning information during coloring. It is next shown that overlapping computation and communication can be formalized as a generalized coloring problem. Modified coloring algorithms are used for this purpose. The PARTI library has been extended to support non-blocking gather-scatter operations and used in conjunction with these algorithms. Practicality issues are evaluated with experimental results on an Intel Paragon multicomputer. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % ipps95.lb.ps % "Exploiting Spatial Regularity in Irregular Iterative Applications" A. Lain and P. Banerjee in Proceedings of the 9th International Parallel Processing Symposium, pages 820-827, Santa Barbara, CA, April 1995. The increasing gap between the speed of microprocessors and memory subsystems makes it imperative to exploit locality of reference in sequential irregular applications. The parallelization of such applications requires special considerations. Current RTS (Run-Time Support) for irregular computations fail to exploit the fine grain regularity present in these applications, producing unnecessary time and memory overheads. PILAR (Parallel Irregular Library with Application of Regularity) is a new RTS for irregular computations that provides a variety of internal representations of communication patterns based on their regularity; allowing for the efficient support of a wide spectrum of regularity under a common framework. Experimental results on the IBM SP-1 and Intel Paragon demonstrate the validity of our approach. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % phd95.l.ps % "Compiler and Run-Time Support for Irregular Computations" A. Lain. PhD thesis, Department of Computer Science, University of Illinois, Urbana, IL, October 1995, CRHC-92-22/UILU-ENG-95-2236. Center for Reliable and High-Performance Computing Advisor: Prithviraj Banerjee There are many important applications in computational fluid dynamics, circuit simulation and structural analysis that can be more accurately modeled using iterations on unstructured grids. In these problems, regular compiler analysis for Massively Parallel Processors (MPP) with distributed address space fails because communication can only be determined at run-time. However, in many of these applications the communication pattern repeats for every iteration. Therefore, equivalent optimizations to the regular case can be achieved with a combination of run-time support (RTS) and compiler analysis. Moreover, many real applications have irregular access patterns with certain structure. This structure is needed to improve sequential performance, better exploiting the memory hierarchy. In many cases this structure is reflected in the array subscript expressions in the form of hybrid accesses, that is, affine combinations of array indirection and regular accesses. We will show that when we get advantage of this structure, we can dramatically reduce pr e-processing overheads, both in terms of memory and time, and improve communication and single node performance. In order to achieve these goals we need a combination of run-time and compiler support. Unfortunately, existing irregular run-time support cannot get advantage of regularity and therefore, our first requirement was to develop a new run-time support for irregular computations, that we called PILAR (Parallel Irregular Library with Application of Regularity). Our second requirement was to automate the task of exploiting regularity in irregular applications through compiler support. For this task we extended the PARADIGM compiler, developed at the University of Illinois, to handle irregular problems. Experimental results will validate the efficacy of our approach. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % icpp94.pscb.ps % "Communication Optimizations used in the PARADIGM Compiler for Distributed- Memory Multicomputers" Daniel J. Palermo, Ernesto Su, John A. Chandy, and Prithviraj Banerjee in Proceedings of the 23rd International Conference on Parallel Processing, pages II:1-10, St. Charles, IL, August 1994. The PARADIGM (PARAllelizing compiler for DIstributed-memory General-purpose Multicomputers) project at the University of Illinois provides a fully automated means to parallelize programs, written in a serial programming model, for execution on distributed-memory multicomputers. To provide efficient execution, PARADIGM automatically performs various optimizations to reduce the overhead and idle time caused by interprocessor communication. Optimizations studied in this paper include message coalescing, message vectorization, message aggregation, and coarse grain pipelining. To separate the optimization algorithms from machine-specific details, parameterized models are used to estimate communication and computation costs for a given machine. The models are also used in coarse grain pipelining to automatically select a task granularity that balances the available parallelism with the costs of communication. To determine the applicability of the optimizations on different machines, we analyzed their performance on an Intel iPSC/860, an Intel iPSC/2, and a Thinking Machines CM-5. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % crhc9509.pb.ps % "Automatic Selection of Dynamic Data Partitioning Schemes for Distributed- Memory Multicomputers" Daniel J. Palermo and Prithviraj Banerjee Technical Report CRHC-95-09/UILU-ENG-95-2210, Center for Reliable and High-Performance Computing, University of Illinois, Urbana, IL, April 1995. For distributed memory multicomputers such as the Intel Paragon, the IBM SP-2, the NCUBE/2, and the Thinking Machines CM-5, the quality of the data partitioning for a given application is crucial to obtaining high performance. This task has traditionally been the user's responsibility, but in recent years much effort has been directed to automating the selection of data partitioning schemes. Several researchers have proposed systems that are able to produce data distributions that remain in effect for the entire execution of an application. For complex programs, however, such static data distributions may be insufficient to obtain acceptable performance. The selection of distributions that dynamically change over the course of a program's execution adds another dimension to the data partitioning problem. In this paper, we present a technique that can be used to automatically determine which partitionings are most beneficial over specific sections of a program while taking into account the added overhead of performing redistribution. This system is being built as part of the PARADIGM (PARAllelizing compiler for DIstributed memory General-purpose Multicomputers) project at the University of Illinois. The complete system will provide a fully automated means to parallelize programs written in a serial programming model obtaining high performance on a wide range of distributed-memory multicomputers. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % lcpc95.pb.ps % "Automatic Selection of Dynamic Data Partitioning Schemes for Distributed- Memory Multicomputers" Daniel J. Palermo and Prithviraj Banerjee in the Proceedings of the 8th Workshop on Languages and Compilers for Parallel Computing, pages 392-406, Columbus, OH, August, 1995 For distributed-memory multicomputers such as the Intel Paragon, the IBM SP-1/SP-2, the NCUBE/2, and the Thinking Machines CM-5, the quality of the data partitioning for a given application is crucial to obtaining high performance. This task has traditionally been the user's responsibility, but in recent years much effort has been directed to automating the selection of data partitioning schemes. Several researchers have proposed systems that are able to produce data distributions that remain in effect for the entire execution of an application. For complex programs, however, such static data distributions may be insufficient to obtain acceptable performance. The selection of distributions that dynamically change over the course of a program's execution adds another dimension to the data partitioning problem. In this paper, we present a technique that can be used to automatically determine which partitionings are most beneficial over specific sections of a program while taking into account the added overhead of performing redistribution. This system is being built as part of the PARADIGM (PARAllelizing compiler for DIstributed memory General-purpose Multicomputers) project at the University of Illinois. The complete system will provide a fully automated means to parallelize programs written in a serial programming model obtaining high performance on a wide range of distributed-memory multicomputers. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % icpp96.pshb.ps % "Compiler Support for Privatization on Distributed-Memory Machines" D. J. Palermo, E. Su, E. W. Hodges IV, and P. Banerjee to appear in the Proceedings of the 25th International Conference on Parallel Processing, Bloomingdale, IL, August 1996. The practice of using temporary scalar or array variables to store the results of common subexpressions presents several challenges to a parallelizing compiler. Not only does dependence analysis and, as a result, parallelization suffer, but existing techniques used for partitioning programs and generating communication for parallel execution on distributed-memory multicomputers also tend to break down. Techniques that have been developed over the years to compensate for this programming practice include scalar expansion, global forward substitution, and privatization, each of which has its own strengths and weaknesses. Compared to scalar expansion and global forward substitution, privatization has the advantage of not causing an increase in memory requirements or operation counts, but when compiling for distributed-memory machines it causes several new problems to arise. In this paper, we present a simple extension to a uniform array-region analysis framework that utilizes privatization information to partition loops and generate efficient communication, using the owner-computes rule, in the presence of temporary variables. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % phd96.p.ps % "Compiler Techniques for Optimizing Communication and Data Distribution for Distributed-Memory Multicomputers" Daniel J. Palermo PhD thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, June 1996, CRHC-96-09/ UILU-ENG-96-2215. Center for Reliable and High-Performance Computing Advisor: Prithviraj Banerjee Distributed-memory multicomputers, such as the Intel iPSC/860, the Intel Paragon, the IBM SP-1/SP-2, the NCUBE/2, and the Thinking Machines CM-5, offer significant advantages over shared-memory multiprocessors in terms of cost and scalability. However, lacking a global address space, they present a very difficult programming model in which the user must specify how data and computation are to be distributed across processors and determine these sections of data to be communicated to specific processors. To overcome this difficulty, significant research effort has been aimed at source-to-source parallelizing compilers for multicomputers that relieve the programmer from the task of program partitioning and communication generation, while the specification of data distributions has remained a responsibility of the programmer. The quality of the data distribution for a given application is crucial to obtaining high performance on distributed-memory multicomputers. By selecting an appropriate data distribution, the amount of communication required by an application can be dramatically reduced. The resulting performance using a given data distribution therefore depends on how well the compiler can optimize the remaining communication. In this thesis, we present and analyze several techniques to optimize communication and, based on the performance of these optimizations, automatically select the best data partitioning for a given application. Previous work in the area of optimizing data distribution used constraints based on performance estimates (which model the communication optimizations) to select high quality data distributions which remain in effect for the entire execution of an application. For complex programs, however, such static data distributions may be insufficient to obtain acceptable performance. The selection of distributions that dynamically change over the course of a program's execution (taking into account the added overhead of performing redistribution) adds another dimension to the data partitioning problem. In this thesis, we present a technique that extends the static data partitioning algorithm to automatically determine those distributions most beneficial over specific sections of a program while taking into account the added overhead of performing redistribution. Finally, we also present an interprocedural data-flow framework that performs the conversion between a distribution-based representation (as specified by the data partitioner) and redistribution-based form (as specified by HPF) while optimizing the amount of redistribution that must be performed. These techniques have been implemented as part of the PARADIGM (PARAllelizing compiler for DIstributed memory General-purpose Multicomputers) project at the University of Illinois. The complete system strives to provide a fully automated means to parallelize sequential programs obtaining high performance on a wide range of distributed-memory multicomputers. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % icpp93.rb.ps % "Processor Allocation and Scheduling of Macro Dataflow Graphs on Distributed Memory Multicomputers by the PARADIGM Compiler" Shankar Ramaswamy and Prithviraj Banerjee in Proceedings of the 22nd International Conference on Parallel Processing, pages II:134-138, St. Charles, IL, August 1993. Functional or Control parallelism is an effective way to increase speedups in Multicomputers. Programs for these machines are represented by Macro Dataflow Graphs (MDGs) for the purpose of functional parallelism analysis and exploitation. Algorithms for allocation and scheduling of MDGs have been discussed along with some analysis of their optimality. These algorithms attempt to minimize the execution time of any given MDG through exploitation of functional parallelism. Our preliminary results show their effectiveness over naive algorithms. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % crhc9409.rb.ps % "Automatic Generation of Efficient Array Redistribution Routines for Distributed Memory Multicomputers" Shankar Ramaswamy and Prithviraj Banerjee Technical Report CRHC-94-09/UILU-ENG-94-2213, Center for Reliable and High-Performance Computing, University of Illinois, Urbana, IL, June 1994. Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribution). This work focuses on automatically generating efficient routines for redistribution. We present a new mathematical representation for regular distributions called PITFALLS and then discuss algorithms for redistribution based on this representation. One of the significant contributions of this work is being able to handle arbitrary source and target processor sets while performing redistribution. Another important contribution is the ability to handle an arbitrary number of dimensions for the array involved in the redistribution in a scalable manner. Our implementation of these techniques is based on an MPI-like communication library. The results presented show the low overheads for our redistribution algorithm as compared to naive runtime methods. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % crhc9410.rsb.ps % "A Framework for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers" Shankar Ramaswamy, Sachin Sapatnekar, and Prithviraj Banerjee Technical Report CRHC-94-10/UILU-ENG-94-2224, Center for Reliable and High-Performance Computing, University of Illinois, Urbana, IL, June 1994. Recent research efforts have shown the benefits of integrating functional and data parallelism over using either pure data parallelism or pure functional parallelism. The work in this paper presents a theoretical framework for deciding on a good execution strategy for a given program based on the available functional and data parallelism in the program. The framework is based on assumptions about the form of computation and communication cost functions for multicomputer systems. We present mathematical functions for these costs and show that these functions are realistic. The framework also requires specification of the available functional and data parallelism for a given problem. For this purpose, we have developed a graphical programming tool. Currently, we have tested our approach using three benchmark programs on the Thinking Machines CM-5 and Intel Paragon. Results presented show that the approach is very effective and can provide a two- to three-fold increase in speedups over approaches using only data parallelism. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % icpp94.rsb.ps % "A Convex Programming Approach for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers" Shankar Ramaswamy, Sachin Sapatnekar, and Prithviraj Banerjee in Proceedings of the 23rd International Conference on Parallel Processing, pages II:116-125, St. Charles, IL, August 1994. Compilers have focussed on the exploitation of one of functional or data parallelism in the past. The PARADIGM compiler project at the University of Illinois is among the first to incorporate techniques for simultaneous exploitation of both. The work in this paper describes the techniques used in the PARADIGM compiler and analyzes the optimality of these techniques. It is the first of its kind to use realistic cost models and includes data transfer costs which all previous researchers have neglected. Preliminary results on the CM-5 show the efficacy of our methods and the significant advantages of using functional and data parallelism together for execution of real applications. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % fmpc95.rb.ps % "Automatic Generation of Efficient Array Redistribution Routines for Distributed Memory Multicomputers" Shankar Ramaswamy and Prithviraj Banerjee in Proceedings of Frontiers '95: The Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 342-349, McLean, VA, February 1995. Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribution). This work focuses on automatically generating efficient routines for redistribution. We present a new mathematical representation for regular distributions called PITFALLS and then discuss algorithms for redistribution based on this representation. A significant contribution of this work is the ability to handle arbitrary source and target processor sets while performing redistribution; another is the ability to handle arbitrary dimensionality for the array being redistributed in a scalable manner. The results presented show low overheads for our redistribution algorithm as compared to naive runtime methods. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % ppl95.rb.ps % "Simultaneous Allocation and Scheduling Using Convex Programming Techniques" S. Ramaswamy and P. Banerjee. in Parallel Processing Letters, December 1995. Simultaneous exploitation of task and data parallelism provides significant benefits for many applications. The basic approach for exploiting task and data parallelism is to use a task graph representation (Macro Dataflow Graph) for programs to decide on the degree of data parallelism to be used for each task (allocation) and an execution order for the tasks (scheduling). Previously, we presented a two step approach for allocation and scheduling by considering the two steps to be independent of each other. In this paper, we present a new simultaneous approach which uses constraints to model the scheduler during allocation. The new simultaneous approach provides significant benefits over our earlier approach for the benchmark task graphs that we have considered. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % phd96.r.ps % "Simultaneous Exploitation of Task and Data Parallelism in Regular Scientific Applications" Shankar Ramaswamy. PhD thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, January 1996. Center for Reliable and High-Performance Computing Advisor: Prithviraj Banerjee Distributed Memory Multicomputers (DMMs) such as the IBM SP-2, the Intel Paragon and the Thinking Machines CM-5 offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this thesis we explore a new compiler optimization for regular scientific applications - the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework and as part of a MATLAB compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and therefore faster. A practical implementation of a task and data parallel scheme of execution for an application on a distributed memory multicomputer also involves data redistribution. This data redistribution causes an overhead. However, as our experimental results show, this overhead is not a problem; execution of a program using task and data parallelism together can be significantly faster than its execution using data parallelism alone. This makes our proposed optimization practical and extremely useful. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % ipps96.rhb.ps % "Compiling Matlab Programs to ScaLAPACK: Exploiting Task and Data Parallelism" Shankar Ramaswamy, Eugene W. Hodges IV, and Prithviraj Banerjee. in Proceedings of the International Parallel Processing Symposium, Honolulu, HI, April 1996. In this paper we suggest a new approach aimed at reducing the effort required to program distributed-memory multicomputers. The key idea in our approach is to automatically convert a program written in a library-based programming language Matlab to a parallel program based on the ScaLAPACK parallel library. In the process of performing this conversion, we apply compiler optimizations that simultaneously exploit task and data parallelism. As our results show, our approach is feasible and practical and our optimization provides significant performance benefits. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % crhc9520.rb.ps % "Compiler Assisted Generation of Error Detecting Parallel Programs" A. Roy-Chowdhury and P. Banerjee. Technical Report CRHC-95-20/UILU-ENG-95-2231, Center for Reliable and High-Performance Computing, University of Illinois, Urbana, IL, August 1995. We have developed an automated, compile time approach to generating error-detecting parallel programs. The compiler is used to identify statements implementing affine transformations within the program and automatically insert code for computing, manipulating, and comparing checksums in order to check the correctness of the code implementing affine transformations. Statements which do not implement affine transformations are checked by duplication. Checksums are reused from one loop to the next if this is possible, rather than recomputing checksums for every statement. A global dataflow analysis is performed in order to determine points at which checksums need to be recomputed. We also use a novel method of specifying the data distributions of the check data using directives provided by the High Performance Fortran (HPF) standard so that the computations on the original data and the corresponding check computations are performed on different processors. Results are presented on an Intel Paragon distributed memory multicomputer. Keywords: Algorithm-Based Fault Tolerance, Checksum Encoding, Parallelizing Compilers, Compiler Assisted Fault Tolerance. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % phd95.r.ps % "Manual and Compiler Assisted Methods for Generating Fault-Tolerant Parallel Programs" A. Roy-Chowdhury PhD thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, December 1995, CRHC-95-27/ UILU-ENG-95-2243. Center for Reliable and High-Performance Computing Advisor: Prithviraj Banerjee Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modification to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose multicomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they have been located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this dissertation, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all of the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numerical algorithms on a distributed memory multicomputer, which demonstrate acceptably low overheads for the single- and double-fault location and recovery cases. For a class of algorithms performing affine transformations, we automate the process of generating an error-detecting version at compile time. The compiler is used to identify loops that perform affine transformations on array elements. These loops are then checked by computing a checksum over the array elements being transformed and transforming the checksums appropriately, which typically results in much smaller overheads than checking the entire code by duplication. Portions of code in the program that are not affine transformations are checked by duplication. An existing source-to-source compiler, Parafrase-2, has been modified to take in programs written in High Performance Fortran (HPF) and output an error-detecting version of the same. Data distributions for the new arrays and checksums introduced are specified by inserting additional HPF directives in the program. The modified program can then be input to a parallelizer for distributed memory machines, such as PARADIGM, to obtain an error-detecting parallel program. We demonstrate results on three numerical programs by executing the error-detecting versions generated by our compiler on a distributed memory multicomputer. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % ms93.s.ps % "Automating Parallelization of Regular Computations for Distributed-Memory Multicomputers" E. Su. Master's thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, March 1993. Center for Reliable and High-Performance Computing Advisor: Prithviraj Banerjee Distributed-memory multicomputers, such as the Intel iPSC/860, the NCUBE/2, the Intel Paragon and the Connection Machine CM-5, offer significant advantages over shared-memory multiprocessors in terms of cost and scalability. Unfortunately, to extract all of that computational power from these machines, users have to write efficient software for them, which is an extremely laborious process. One major reason for this difficulty is the absence of a single global shared address space. As a result, the programmer has to distribute code and data on processors himself, and manage communication among tasks explicitly. Clearly there is a need for efficient parallel programming support on these machines. The PARADIGM parallelizing compiler project addresses that problem by developing an automated means to convert programs written for a serial programming model and compiling them for efficient execution on distributed-memory multicomputers. In this thesis we discuss automatic parallelization of regular computations using symbolic sets as a representation for data accesses and communication. Algorithms for the computation partitioning and communication generation are discussed. Results on the speedup and efficiency of some SPMD node programs generated by PARADIGM are studied to determine the applicability of our approach. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % icpp93.spb.ps % "Automating Parallelization of Regular Computations for Distributed-Memory Multicomputers" Ernesto Su, Daniel J. Palermo, and Prithviraj Banerjee in Proceedings of the 1993 International Conference on Parallel Processing, pages II:30-38, St. Charles, IL, August 1993. Distributed-memory multicomputers such as the Intel iPSC/860, the NCUBE/2, the Intel Paragon and the Connection Machine CM-5 offer significant advantages over shared-memory multiprocessors in terms of cost and scalability. Unfortunately, to extract all the computational power from these machines, users have to parallelize their existing serial programs, which can be an extremely laborious process. One major reason for this difficulty is the absence of a single global shared address space. As a result, the programmer has to distribute code and data on processors and manage communication among tasks explicitly. Clearly there is a need for efficient parallelizing compiler support on these machines. The PARADIGM project at the University of Illinois addresses these problems by developing a fully automated means to translate serial programs for efficient execution on distributed-memory multicomputers. In this paper we discuss parallelization of regular computations using symbolic sets as a representation for data accesses and communication. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % pact94.spb.ps % "Processor Tagged Descriptors: A Data Structure for Compiling for Distributed- Memory Multicomputers" Ernesto Su, Daniel J. Palermo, and Prithviraj Banerjee in Proceedings of the 1994 International Conference on Parallel Architectures and Compilation Techniques, pages 123-132, Montre'al, Canada, August 1994. The computation partitioning, communication analysis, and optimization phases performed during compilation for distributed-memory multicomputers require an efficient way of describing distributed sets of iterations and regions of data. Processor Tagged Descriptors (PTDs) provide these capabilities through a single set representation parameterized by the processor location for each dimension of a virtual mesh. A uniform representation is maintained for every processor in the mesh, whether it is a boundary or an interior node. As a result, operations on the sets are very efficient because the effect on all processors in a dimension can be captured in a single symbolic operation. In addition, PTDs are easily extended to an arbitrary number of dimensions, necessary for describing iteration sets in multiply nested loops as well as sections of multidimensional arrays. Using the symbolic features of PTDs it is also possible to generate code for variable numbers of processors, thereby allowing a compiled program to run unchanged on varying sized machines. The PARADIGM (PARAllelizing compiler for DIstributed-memory General-purpose Multicomputers) project at the University of Illinois utilizes PTDs to provide an automated means to parallelize serial programs for execution on distributed-memory multicomputers. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % ics95.slrphb.ps % "Advanced Compilation Techniques in the PARADIGM Compiler for Distributed- Memory Multicomputers" Ernesto Su, Antonio Lain, Shankar Ramaswamy, Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee in Proceedings of the 9th ACM International Conference on Supercomputing, pages 424-433, Barcelona, Spain, July 1995. The PARADIGM compiler project provides an automated means to parallelize programs, written in a serial programming model, for efficient execution on distributed-memory multicomputers. A previous implementation of the compiler based on the PTD representation allowed symbolic array sizes, affine loop bounds and array subscripts, and variable number of processors, provided that arrays were single- or multi-dimensionally block distributed. The techniques presented here extend the compiler to also accept multi- dimensional cyclic and block-cyclic distributions within a uniform symbolic framework. These extensions demand more sophisticated symbolic manipulation capabilities. A novel aspect of our approach is to meet this demand by interfacing PARADIGM with a powerful off-the-shelf symbolic package, Mathematica(tm). This paper describes some of the Mathematica(tm) routines that performs various transformations, shows how they are invoked and used by the compiler to overcome the new challenges, and presents experimental results for code involving cyclic and block-cyclic arrays as evidence of the feasibility of the approach.