Winter Quarter 1998, Seminar Series
The speakers and topics of seminars that will be held during winter
are available here.
The remaining schedule for the quarter, shall be shortly posted on the
Web.
Memory Hierarchies for Future Microprocessors
Douglas C. Burger
Computer Sciences Department,
University of Wisconsin,
1210 West Dayton Street, Madison,
Wisconsin, USA.
Abstract
Modern computer systems contain sophisticated memory systems that are
organized in hierarchies (registers, one to three levels of cache,
physical memory, and the disk or network). The goal of these
hierarchies is to provide the illusion of a single memory that is
large, fast, and cheap. Unfortunately, exponential growth in
processor performance is making this goal increasingly hard to attain.
In my dissertation work, I explore a number of solutions, ranging from
conventional to radical, that will improve memory system performance
for future microprocessors. In this talk, I will survey some of
these solutions, which include more efficient caches, merging large
caches with main memory, and one novel, memory-centric architecture
(DataScalar). From these analyses, I will draw conclusions about how
future systems are likely to evolve.
Biographical sketch
Doug Burger is a Ph.D. candidate at the University of
Wisconsin-Madison. His dissertation topic is memory hierarchies and
system organizations for advanced microprocessors. He received an
M.S. from the University of Wisconsin-Madison and a B.S. from Yale
University, both in computer science. He expects to finish his
Ph.D. in 1998 and hopes to be gainfully employed shortly thereafter.
RCSP: A Parameterized Reduced Configuration Space Processor and
its Programming Environment
Dr. Krishna V. Palem
Department of Computer Science,
Courant Institute of Mathematical Sciences,
New York University,
New York, USA.
Abstract
Traditional ASIC/DSP based approaches to high performance embedded
system development are usually characterized by high design time/cost
overhead and/or poor application portability. While commodity
microprocessors have not been the platform of choice for
high-performance embedded systems in the past, recent developments
in reconfigurable computing devices and the real possibility of
manufacturing integrated circuits with extremely high transistor counts
have presented exciting new opportunities.
Taking advantage of these opportunities along with our past ILP
compilation experience, we propose Reduced Configuration Space
Processor (RCSP), a parameterized family of reconfigurable processor
architectures for high performance embedded system development. The
RCSP execution model is a novel extension to the traditional ILP
execution model, in that, the data-paths and the execution unit
resources for an RCSP machine are dynamically reconfigurable under the
control of the executing program. We suggest one possible
micro-architecture for the RCSP model, which also serves as a target
for our planned compilation environment.
Based on RCSP, we have identified the software architecture to provide
a complete embedded system development framework composed of
Concurrently, we have made progress in characterizing instruction
scheduling optimizations for program graphs with real-time constraints,
targeted towards ILP processors which will be further extended to
this domain. We are also developing a compilation environment
that can take a high-level language program specified in C or C++
for example, and automatically generate code to run on the
RCSP. We anticipate that this environment will help cut down
software development times significantly, when compared to its DSP
counterparts.
We shall present the processor architecture and compilation framework
along with some preliminary performance estimates for a few key
applications.
This is a joint work with Suren Talla.
Biographical sketch
Dr. Krishna V. Palem has been an Associate Professor of Computer
Science in the Courant Institute of Mathematical Sciences, NYU, since
September 1994. Prior to this, he was a research staff member at the
IBM T. J. Watson Research Center, and an advanced technology
consultant on compiler optimizations at the IBM Santa Teresa
Laboratory working on parallel and optimizing compiler technologies.
At the Courant Institute of NYU, he leads the
ReaCT-ILP project aimed at developing programming tools
and compiler optimizations for rapid prototyping of
embedded applications on
hardware with instruction-level parallelism, as well as on
adaptive targets. A significant portion of this work is aimed at
programming support for embedded applications. In these areas, he has
corporate awards for excellence from Hewlett-Packard, IBM and
Panasonic.
Scalability of a Distributed Memory Finite-Difference Time-Domain
Algorithm for the Solution of Maxwell's Equations
Jethro H. Greene
Advisee of Dr. Allen Taflove,
Graduate student,
Northwestern University,
Illinois, USA.
Abstract
Solving Maxwell's equations is essential for understanding
communication, radar, high-speed electronics, wave interactions with
human tissue, and optical switching. However, modeling them with
sufficient resolution on complex geometries requires extreme
computational expense. The scalability of a finite-difference
time-domain algorithm for solving Maxwell's equations was studied on
both shared- and distributed-memory parallel computers and its
dependency on communication bandwidth and latency was determined.
Design and Evaluation of Network Interfaces for System Area
Networks
Shubhendu S. Mukherjee
Computer Sciences Department,
University of Wisconsin,
1210 West Dayton Street, Madison,
Wisconsin, USA.
Abstract
Clusters of Workstations (COWs) provide a low-cost alternative to
message-passing parallel computers. COWs connect multiple commodity
workstations with a System Area Network (SAN). The high performance
and high reliability of a SAN provides a COW with the potential to
efficiently support the low-latency and fine-grain communication that
arises in many parallel applications.
Unfortunately, a workstation's poor interaction with a network interface
(NI) significantly impedes harnessing the benefits of a SAN. An
NI is a device that allows a workstation to send and receive messages
from a network. Conventional workstation NIs suffer from several
sources of high latency because they were designed with an interface
similar to a disk's interface.
During this talk I will outline techniques to improve the performance
of interactions between a workstation and an NI. A key principle
underlies these techniques: an NI access should be treated like a memory
access, and not like a disk interface access. I have captured this
principle in a novel NI design called Coherent Network Interfaces
(CNIs). CNIs interact with processors and memory of a workstation
using coherent, cachable memory operations. My simulation results show
that when compared to a more conventional NI, a CNI can improve the
performance of seven parallel scientific applications between 6-190%.
Biographical sketch
Shubhendu S. Mukherjee
(http://www.cs.wisc.edu/~shubu) is a Ph.D. candidate at the
University of Wisconsin-Madison. His dissertation research focussed on
network interfaces for message-passing parallel computers. He has also
worked extensively on cache coherence protocols as part of his
research with the Wisconsin Wind Tunnel project. Mukherjee received an
MS from the University ofWisconsin-Madison in 1993 and a BTech from the
Indian Institute of Technology, Kanpur in 1991.
Memory Dependence Prediction
Andreas Moshovos
Computer Sciences Department,
University of Wisconsin,
1210 West Dayton Street, Madison,
Wisconsin, USA.
Abstract
Programs are written with the abstraction of uniform, fast access to
all memory locations via the use of addresses. Unfortunately, not all
of memory can be built to match the processor's speed. To approximate a
large and fast memory, modern computer systems employ memory
hierarchies (e.g., several levels of caches). Traditionally, these
memory hierarchies are designed to exploit characteristics of the
program's address stream (e.g., caches exploit the locality in the
address stream of ordinary programs). However, the exponential growth
in processor performance that we continue to enjoy challenges these
memory hierarchies creating a need for further improvement.
In this presentation, I revisit memory hierarchy design and view memory
as an inter-operation communication agent: memory is often used to
communicate values among program operations. This perspective exposes
limitations that are inherent in the use of an address based memory
hierarchy. To overcome these limitations I introduce Memory Dependence
Prediction, a technique that transparently captures the communication
relationships among memory accessing operations. Memory dependence
prediction enables a number of novel techniques that improve upon
traditional memory hierarchies. In this presentation I will focus on
the following two techniques: (1) Dynamic Speculation and
Synchronization, and (2) Speculative Memory Cloaking. The first
technique exposes the parallelism that is hindered by the use of
addresses. The second technique modifies memory operations as they are
encountered for execution so that they can communicate through a low
latency communication mechanism.
I argue that these techniques represent a first step toward exploiting
the behavior of the program's operation to improve memory hierarchies.
Biographical sketch
Andreas Moshovos received his undergraduate and M.Sc. degrees in
Computer Science from the University of Crete, Greece in 1990 and 1992
respectively. He then joined the Computer Science Department of New
York University as a Ph.D. candidate. Since July 1993 he is a Ph.D.
candidate at the Computer Sciences Department of the University of
Wisconsin-Madison where he works under the supervision of Prof.
Gurindar S. Sohi and as a member of the Multiscalar Architecture group.
He expects to graduate with a Ph.D. degree in Computer Science by
September of 1998. His thesis research introduced memory dependence
prediction and focuses on micro-architectural techniques to improve
upon traditional memory hierarchies.
Metacomputing Environments for Optimization
Dr. Jorge Nocedal
Deputy Director, Optimization Technology Center,
Department of Electrical and Computer Engineering,
Northwestern University,
Illinois, USA.
Abstract
The Network-Enabled Optimization System (NEOS) is an
environment for modeling and solving optimization problems
on the Internet. It offers a collection of servers geared to
a variety of optimization problems, and also allows for different
user interfaces. Our interest is to make NEOS more versatile
by allowing interactions with user programs, and to make it more
powerful by solving very large application problems in a distributed
and heterogenous network.
In this talk we describe the areas of optimization we plan
to study, and outline how Globus and Condor will be adapted
to serve our purposes.
Biographical sketch
Improving Data Supply for Multi-Issue Processors
Jude A. Rivers
Advanced Computer Architecture Laboratory,
Department of Electrical Engineering and Computer Science,
The University of Michigan at Ann Arbor,
Michigan, USA.
Abstract
To take advantage of advances in VLSI technology for higher
performance, current and future high-end microprocessors are being
designed to issue and execute multiple instructions per cycle. For
example, processors capable of issuing 16 instructions per cycle are
being discussed. At the same time, memory speeds are not increasing as
fast as processor speeds. With memory operations accounting for about a
third of the average instruction stream, more and more demand is being
placed on the data memory hierarchy. In particular, on-chip caches
commonly serve as the head of the data memory hierarchy. An on-chip
data cache, however, is only beneficial if it can supply the requested
data within a short cycle time. With these emerging data supply
requirements, there is a great need for data cache structures that can
effectively minimize the average data access times and supply multiple
data in a single cycle.
This talk will be presented in two parts. In the first part, I will
introduce the Non-Temporal Streaming (NTS) Cache. The NTS Cache is an
example of multi-lateral cache structures that partition the first
level (L1) cache into multiple subcaches. For these designs, the data
reference stream of a program is subdivided into appropriate classes,
and each class is mapped into a specific subcache whose management
policy is more suitable for the access patterns and/or usage
characteristics of that class. This sort of selective organization
and caching actively retains more useful data in the L1 cache
structure, which translates into more cache hits, less cache-memory
bus contention, and overall improvement in the average data access
time. The NTS Cache utilizes data reuse information for intelligent
data placement and active management of its 2-unit lateral structure.
I will compare and contrast the NTS Cache with other proposed
multi-lateral cache designs, and also show that the multi-lateral
design approach generally provides an attractive alternative to the
current trend of increasingly large but poorly managed caches.
The second part of the talk will explore the performance and
limitations of current approaches to multiple data supply. Currently,
multibanking and multiporting by replication are the two popular and
implementable approaches to providing multiple ports to the data
cache. However, these approaches do not appear scalable with
increasing data access parallelism. Whereas the multibanking technique
suffers substantial performance degradation because of bank conflicts,
multiporting by replication is die area limited and plagued by the
need to broadcast stores for coherence. Analysis of the average SPEC95
memory reference stream, however, reveals that a greater majority of
all conflicts in a multibank cache is as a result of consecutive
references that map into the same cache line of the same cache bank. I
will introduce the Locality-Based Interleaved Cache (LBIC), which is
built on traditional multibanking and employs limited multiporting in
exploiting same line spatial locality. Our detailed simulation results
suggest that the LBIC structure is capable of outperforming current
multiporting approaches.
I will conclude this talk by exploring the data supply potential of
cache memory structures that integrate both the NTS and LBIC schemes.
Biographical sketch