DATE: Tuesday Jan 12
TIME/PLACE: 4-5PM, L324
SPEAKER: Jian Chen, ECE Department
Abstract:
Distributed systems, which consist of a collection of high
performance systems interconnected via high performance networks
(e.g. ATM), are becoming feasible platforms for execution
of large-scale, complex problems. In this talk, we address
various issues related to mesh partitioning for distributed systems.
These issues include the metric used to compare different partitions,
efficiency of the application executing on a distributed system, the
number of cut sets, and the advantage of exploiting heterogeneity
in network performance. We present a tool called PART, for automatic
mesh partitioning for distributed systems. The novel feature
of PART is that it considers heterogeneities in the application and the
distributed system. The heterogeneities in the distributed system
include processor and network performance; the heterogeneities in the
application include computational complexities.
Experimental results are presented for partitioning regular
and irregular finite element meshes for the WHAMS2D application
executing on a distributed system consisting of two IBM SPs.
The results from the regular problems indicate a 33-46% increase in
efficiency when processor performance is considered as compared
to the conventional even partitioning; the results also indicate an
additional 5-16% increase in efficiency when network performance
is considered. The result from the irregular problem indicate a 21%
increase in efficiency when processor and network performance are
considered as compared to even partitioning.
DATE: Tuesday Jan 19
TIME/PLACE: 4-5PM, L324
SPEAKER: Prof. Andreas Moshovos, ECE Department
Abstract:
While simple in nature, memory dependence prediction can be used to
significantly improve performance. In this talk I will present two
applications of memory dependence prediction. The first aims at executing
loads and stores with as much parallelism as possible. The second
dynamically (i.e., while the program executes and in a programmer/program
transparent way) converts parts of the program in a different form,
replacing memory with more effecient mechanisms.
Finally and time permitting, I will touch upon future research directions.
Note: Most of this presentation is geared toward a general audience.
In this talk I will introduce Memory Dependence Prediction and some of
its applications. Memory dependence prediction provides a highly accurate
guess on whether a load or a store will read or write a memory address read
or written by a preceding store.
DATE: Tuesday Jan 26
TIME/PLACE: 4-5PM, L324
SPEAKER: Dhruva Chakrabarti, ECE Department
Abstract:
In this talk, I will present an approach for parallelizing hybrid
applications to be executed on distributed memory machines. I will
explain the issues relevant to obtaining good performance for parallel
hybrid programs. Several data distribution schemes will be considered
and schemes to handle various kinds of accesses will be
presented. Comparison of our scheme with some others will be presented
at an abstract level. I will finally present some experimental results
to show the feasibility of our approach.
DATE: Tuesday Feb 02
TIME/PLACE: 4-5PM, L324
SPEAKER: Prof. Jennifer Schopf, Computer Science Department
Abstract:
This talk will discuss a promising approach for predicting and
scheduling distributed parallel applications using stochastic
data, or distributions. Whereas a point value provides a
single value representation of a quantity, a stochastic value
provides a set of possible values weighted by probabilities
to represent a range of likely behavior. We then use these
values in a stochastic scheduling policy to achieve
faster execution times with more predictable application
behavior.
Current distributed parallel platforms can provide the
resources required to execute a scientific application
efficiently. However, when these platforms are shared by
multiple users, the performance of the applications using
the system may be impacted in dynamic and often
unpredictable ways. In particular, it becomes challenging
to achieve good performance for any single application
using the shared resources.
DATE: Tuesday Feb 09
TIME/PLACE: 4-5PM, B211 (Mechanical Engineering Conference Room)
SPEAKER: Prof. Babak Falsafi, ECE Department, Purdue University
Abstract:
My research focus is on designing intelligent DSMs that smell, act,
and feel like the uniform memory architecture of SMPs. In these DSMs
the memory system transparently--without involving the application
software--hides remote communication. In this talk, I will present
results from two papers (to appear in ISCA '99) describing transparent
hardware techniques to improve DSM performance. In the first half of
the talk, I will introduce a new memory consistency model, SC++, that
preserves the simple and intuitive programming interface of sequential
consistency (SC) while providing the performance of the best of
relaxed consistency models, Release Consistency (RC). SC++ uses ILP
instruction speculation mechanisms to execute memory operations out of
order. I will present simulation results indicating that SC++ performs
as well as RC.
In the second half of the talk, I will present novel hardware DSMs
which learn, predict, and speculatively execute shared-memory accesses
in advance to hide some of or all the remote access latency. The key
mechanism behind such these DSMs are Memory Sharing Predictors,
pattern-based predictors based on the ubiquitous Yeh&Patt two-level
branch predictors that learn and accurately predict memory sharing
patterns. I will present prediction accuracies and implementation cost
for two MSP designs, and show preliminary performance results on
speculatively executing shared-memory operations.
Distributed shared memory (DSM) has emerged as a promising approach to
building large-scale shared-memory servers. DSM provides a
shared-memory programming abstraction despite having memory physically
distributed. Application performance tuning on DSM, however, is often
difficult due to the inherent non-uniform nature of memory. In DSM,
remote memory accesses often take up to 10 to 100 times longer than
local memory accesses. Conventional proposals to improve DSM
performance provide software interfaces to optimize shared memory
accesses and leave the burden of using them to the application
programmer and/or compiler writer.
DATE: Tuesday Feb 16
TIME/PLACE: 4-5PM, L324
SPEAKER: Pramod Joisha, ECE Department
Abstract:
In this talk, we focus on how ownership sets may be efficiently
determined in the context of the HPF language, and show how the
structure of these sets can be symbolically characterized in the
presence of the most general data alignment and data distribution
directives. Starting from the originally proposed system of equalities
and inequalities, we arrive at a new system which allows us to scan the
members of the ownership set using the course vector as the only
auxiliary vector. Furthermore, the formulation makes it possible to
enumerate the elements of the ownership set exactly once, a feature
that is very beneficial when such sets are applied to handle DO loops
qualified by HPF's INDEPENDENT directive.
We present polynomial-time schemes that determine whether the ownership
set of a particular processor with respect to some data array is the
empty set, or whether the ownership set of every processor with respect
to some data array is the empty set. We also show how data distribution
directives with unspecified processor meshes can be efficiently handled
at compile-time. We derive and develop important and general properties
relating to HPF alignments and distributions, and which permit the
elimination of redundant communication in the presence of array
replication, and which also make it possible to avoid the generation of
communication code when pairs of array references are ultimately mapped
onto the same processors. Experimental data demonstrating the improved
code performance that the latter optimization enables is also presented
and discussed.
The underpinnings of our methods can be traced to earlier work by
Ancourt et al. Our schemes provide a robust framework for handling the
general regular compilation problem, even in the presence of loop and
array subscript expressions that are complex affine functions of
variables unknown at compile-time. In attaining its symbolic
manipulation capabilities, our compilation infrastructure relied upon a
commercially available industrial strength symbolic analysis software
called Mathematica.
Ownership sets are fundamental to the partitioning of program
computation across processors by the owner-computes rule in a
message-passing distributed-memory multicomputer. In programs written
using the HPF language, the compilation process is aided by data
partitioning information, provided through user-specified data
alignment and data distribution directives.
DATE: Tuesday Feb 23
TIME/PLACE: 4-5PM, L324
SPEAKER: Mahmut Kandemir, ECE Department
Abstract:
This optimization framework was used with sixteen floating-point programs
from several benchmarks and math libraries, and the performance was
measured in a single node of the SGI Origin $2000$
distributed-shared-memory machine. The results demonstrate that the
approach is very effective in enhancing cache locality and outperforms
current solutions that use either loop based or data layout based
transformations alone.
In this talk I will present a new integrated compiler framework for
improving the cache performance of scientific applications. In addition to
applying loop transformations, the framework includes data layout
optimizations, i.e., those that change the memory layouts of arrays data
structures. A key characteristic of this approach is that loop
transformations are used to improve temporal locality while data layout
optimizations are used to improve spatial locality.
DATE: Tuesday Mar 02
TIME/PLACE: 4-5PM, L324
SPEAKER: Jonathan Geisler, ECE Department
Abstract:
Traditional performance optimazation techniques have focused on finding
the kernel in a program that is the most time consuming and attempting
to optimize it. We introduce a methodology for measuring and
representing the interaction, or coupling, between kernels that
improves upon the accuracy of the traditional method. We examine the
coupling that exists in five different programs at both the first and
second level caches. Next, we study the effects of scaling on the
coupling effects. Finally, we demonstrate the benefits of using the
new methodology by 1) developing a new hybrid algorithm for the
conjugate gradient application that results in fewer first level cache
misses than the other algorithms studied, and 2) transforming the
array addresses of two applications to reduce the number of second
level cache misses.