Title:

Issues in Mesh Partitioning for Distributed Systems

DATE: Tuesday Jan 12
TIME/PLACE: 4-5PM, L324
SPEAKER: Jian Chen, ECE Department

Abstract:
Distributed systems, which consist of a collection of high performance systems interconnected via high performance networks (e.g. ATM), are becoming feasible platforms for execution of large-scale, complex problems. In this talk, we address various issues related to mesh partitioning for distributed systems. These issues include the metric used to compare different partitions, efficiency of the application executing on a distributed system, the number of cut sets, and the advantage of exploiting heterogeneity in network performance. We present a tool called PART, for automatic mesh partitioning for distributed systems. The novel feature of PART is that it considers heterogeneities in the application and the distributed system. The heterogeneities in the distributed system include processor and network performance; the heterogeneities in the application include computational complexities. Experimental results are presented for partitioning regular and irregular finite element meshes for the WHAMS2D application executing on a distributed system consisting of two IBM SPs. The results from the regular problems indicate a 33-46% increase in efficiency when processor performance is considered as compared to the conventional even partitioning; the results also indicate an additional 5-16% increase in efficiency when network performance is considered. The result from the irregular problem indicate a 21% increase in efficiency when processor and network performance are considered as compared to even partitioning.



Title:

Memory Dependence Prediction and Future Directions

DATE: Tuesday Jan 19
TIME/PLACE: 4-5PM, L324
SPEAKER: Prof. Andreas Moshovos, ECE Department

Abstract:
In this talk I will introduce Memory Dependence Prediction and some of its applications. Memory dependence prediction provides a highly accurate guess on whether a load or a store will read or write a memory address read or written by a preceding store.

While simple in nature, memory dependence prediction can be used to significantly improve performance. In this talk I will present two applications of memory dependence prediction. The first aims at executing loads and stores with as much parallelism as possible. The second dynamically (i.e., while the program executes and in a programmer/program transparent way) converts parts of the program in a different form, replacing memory with more effecient mechanisms.

Finally and time permitting, I will touch upon future research directions.

Note: Most of this presentation is geared toward a general audience.



Title:

A Compilation Framework for Hybrid Applications

DATE: Tuesday Jan 26
TIME/PLACE: 4-5PM, L324
SPEAKER: Dhruva Chakrabarti, ECE Department

Abstract:
In this talk, I will present an approach for parallelizing hybrid applications to be executed on distributed memory machines. I will explain the issues relevant to obtaining good performance for parallel hybrid programs. Several data distribution schemes will be considered and schemes to handle various kinds of accesses will be presented. Comparison of our scheme with some others will be presented at an abstract level. I will finally present some experimental results to show the feasibility of our approach.



Title:

Prediction and Scheduling for Shared Clusters of Workstations

DATE: Tuesday Feb 02
TIME/PLACE: 4-5PM, L324
SPEAKER: Prof. Jennifer Schopf, Computer Science Department

Abstract:
Current distributed parallel platforms can provide the resources required to execute a scientific application efficiently. However, when these platforms are shared by multiple users, the performance of the applications using the system may be impacted in dynamic and often unpredictable ways. In particular, it becomes challenging to achieve good performance for any single application using the shared resources.

This talk will discuss a promising approach for predicting and scheduling distributed parallel applications using stochastic data, or distributions. Whereas a point value provides a single value representation of a quantity, a stochastic value provides a set of possible values weighted by probabilities to represent a range of likely behavior. We then use these values in a stochastic scheduling policy to achieve faster execution times with more predictable application behavior.



(Invited Speaker)

Title:

Intelligent Distributed Shared Memory

DATE: Tuesday Feb 09
TIME/PLACE: 4-5PM, B211 (Mechanical Engineering Conference Room)
SPEAKER: Prof. Babak Falsafi, ECE Department, Purdue University

Abstract:
Distributed shared memory (DSM) has emerged as a promising approach to building large-scale shared-memory servers. DSM provides a shared-memory programming abstraction despite having memory physically distributed. Application performance tuning on DSM, however, is often difficult due to the inherent non-uniform nature of memory. In DSM, remote memory accesses often take up to 10 to 100 times longer than local memory accesses. Conventional proposals to improve DSM performance provide software interfaces to optimize shared memory accesses and leave the burden of using them to the application programmer and/or compiler writer.

My research focus is on designing intelligent DSMs that smell, act, and feel like the uniform memory architecture of SMPs. In these DSMs the memory system transparently--without involving the application software--hides remote communication. In this talk, I will present results from two papers (to appear in ISCA '99) describing transparent hardware techniques to improve DSM performance. In the first half of the talk, I will introduce a new memory consistency model, SC++, that preserves the simple and intuitive programming interface of sequential consistency (SC) while providing the performance of the best of relaxed consistency models, Release Consistency (RC). SC++ uses ILP instruction speculation mechanisms to execute memory operations out of order. I will present simulation results indicating that SC++ performs as well as RC.

In the second half of the talk, I will present novel hardware DSMs which learn, predict, and speculatively execute shared-memory accesses in advance to hide some of or all the remote access latency. The key mechanism behind such these DSMs are Memory Sharing Predictors, pattern-based predictors based on the ubiquitous Yeh&Patt two-level branch predictors that learn and accurately predict memory sharing patterns. I will present prediction accuracies and implementation cost for two MSP designs, and show preliminary performance results on speculatively executing shared-memory operations.



Title:

Efficiently Exploiting Ownership Sets in HPF

DATE: Tuesday Feb 16
TIME/PLACE: 4-5PM, L324
SPEAKER: Pramod Joisha, ECE Department

Abstract:
Ownership sets are fundamental to the partitioning of program computation across processors by the owner-computes rule in a message-passing distributed-memory multicomputer. In programs written using the HPF language, the compilation process is aided by data partitioning information, provided through user-specified data alignment and data distribution directives.

In this talk, we focus on how ownership sets may be efficiently determined in the context of the HPF language, and show how the structure of these sets can be symbolically characterized in the presence of the most general data alignment and data distribution directives. Starting from the originally proposed system of equalities and inequalities, we arrive at a new system which allows us to scan the members of the ownership set using the course vector as the only auxiliary vector. Furthermore, the formulation makes it possible to enumerate the elements of the ownership set exactly once, a feature that is very beneficial when such sets are applied to handle DO loops qualified by HPF's INDEPENDENT directive.

We present polynomial-time schemes that determine whether the ownership set of a particular processor with respect to some data array is the empty set, or whether the ownership set of every processor with respect to some data array is the empty set. We also show how data distribution directives with unspecified processor meshes can be efficiently handled at compile-time. We derive and develop important and general properties relating to HPF alignments and distributions, and which permit the elimination of redundant communication in the presence of array replication, and which also make it possible to avoid the generation of communication code when pairs of array references are ultimately mapped onto the same processors. Experimental data demonstrating the improved code performance that the latter optimization enables is also presented and discussed.

The underpinnings of our methods can be traced to earlier work by Ancourt et al. Our schemes provide a robust framework for handling the general regular compilation problem, even in the presence of loop and array subscript expressions that are complex affine functions of variables unknown at compile-time. In attaining its symbolic manipulation capabilities, our compilation infrastructure relied upon a commercially available industrial strength symbolic analysis software called Mathematica.



Title:

Improving Locality Using Loop and Data Transformations in an Integrated Framework

DATE: Tuesday Feb 23
TIME/PLACE: 4-5PM, L324
SPEAKER: Mahmut Kandemir, ECE Department

Abstract:
In this talk I will present a new integrated compiler framework for improving the cache performance of scientific applications. In addition to applying loop transformations, the framework includes data layout optimizations, i.e., those that change the memory layouts of arrays data structures. A key characteristic of this approach is that loop transformations are used to improve temporal locality while data layout optimizations are used to improve spatial locality.

This optimization framework was used with sixteen floating-point programs from several benchmarks and math libraries, and the performance was measured in a single node of the SGI Origin $2000$ distributed-shared-memory machine. The results demonstrate that the approach is very effective in enhancing cache locality and outperforms current solutions that use either loop based or data layout based transformations alone.



Title:

Performance Coupling: A Methodology for Analyzing Application Performance using Kernel Performance

DATE: Tuesday Mar 02
TIME/PLACE: 4-5PM, L324
SPEAKER: Jonathan Geisler, ECE Department

Abstract:
Traditional performance optimazation techniques have focused on finding the kernel in a program that is the most time consuming and attempting to optimize it. We introduce a methodology for measuring and representing the interaction, or coupling, between kernels that improves upon the accuracy of the traditional method. We examine the coupling that exists in five different programs at both the first and second level caches. Next, we study the effects of scaling on the coupling effects. Finally, we demonstrate the benefits of using the new methodology by 1) developing a new hybrid algorithm for the conjugate gradient application that results in fewer first level cache misses than the other algorithms studied, and 2) transforming the array addresses of two applications to reduce the number of second level cache misses.