Winter Quarter 1998, Seminar Series

The speakers and topics of seminars that will be held during winter are available here.

The remaining schedule for the quarter, shall be shortly posted on the Web.

Memory Hierarchies for Future Microprocessors

Douglas C. Burger
Computer Sciences Department,
University of Wisconsin,
1210 West Dayton Street, Madison,
Wisconsin, USA.

Abstract

Modern computer systems contain sophisticated memory systems that are organized in hierarchies (registers, one to three levels of cache, physical memory, and the disk or network). The goal of these hierarchies is to provide the illusion of a single memory that is large, fast, and cheap. Unfortunately, exponential growth in processor performance is making this goal increasingly hard to attain. In my dissertation work, I explore a number of solutions, ranging from conventional to radical, that will improve memory system performance for future microprocessors. In this talk, I will survey some of these solutions, which include more efficient caches, merging large caches with main memory, and one novel, memory-centric architecture (DataScalar). From these analyses, I will draw conclusions about how future systems are likely to evolve.

Biographical sketch

Doug Burger is a Ph.D. candidate at the University of Wisconsin-Madison. His dissertation topic is memory hierarchies and system organizations for advanced microprocessors. He received an M.S. from the University of Wisconsin-Madison and a B.S. from Yale University, both in computer science. He expects to finish his Ph.D. in 1998 and hopes to be gainfully employed shortly thereafter.

RCSP: A Parameterized Reduced Configuration Space Processor and its Programming Environment

Dr. Krishna V. Palem
Department of Computer Science,
Courant Institute of Mathematical Sciences,
New York University,
New York, USA.

Abstract

Traditional ASIC/DSP based approaches to high performance embedded system development are usually characterized by high design time/cost overhead and/or poor application portability. While commodity microprocessors have not been the platform of choice for high-performance embedded systems in the past, recent developments in reconfigurable computing devices and the real possibility of manufacturing integrated circuits with extremely high transistor counts have presented exciting new opportunities.

Taking advantage of these opportunities along with our past ILP compilation experience, we propose Reduced Configuration Space Processor (RCSP), a parameterized family of reconfigurable processor architectures for high performance embedded system development. The RCSP execution model is a novel extension to the traditional ILP execution model, in that, the data-paths and the execution unit resources for an RCSP machine are dynamically reconfigurable under the control of the executing program. We suggest one possible micro-architecture for the RCSP model, which also serves as a target for our planned compilation environment.

Based on RCSP, we have identified the software architecture to provide a complete embedded system development framework composed of

A parameterized compiler for targeting timing annotated source descriptions to the RCSP's.

Profiling and simulation environment to evaluate post compilation performance and correctness.

A design space explorer for traversing the space of RCSP's most suitable for the embedded application domain of interest.

Concurrently, we have made progress in characterizing instruction scheduling optimizations for program graphs with real-time constraints, targeted towards ILP processors which will be further extended to this domain. We are also developing a compilation environment that can take a high-level language program specified in C or C++ for example, and automatically generate code to run on the RCSP. We anticipate that this environment will help cut down software development times significantly, when compared to its DSP counterparts.

We shall present the processor architecture and compilation framework along with some preliminary performance estimates for a few key applications.

This is a joint work with Suren Talla.

Biographical sketch

Dr. Krishna V. Palem has been an Associate Professor of Computer Science in the Courant Institute of Mathematical Sciences, NYU, since September 1994. Prior to this, he was a research staff member at the IBM T. J. Watson Research Center, and an advanced technology consultant on compiler optimizations at the IBM Santa Teresa Laboratory working on parallel and optimizing compiler technologies. At the Courant Institute of NYU, he leads the ReaCT-ILP project aimed at developing programming tools and compiler optimizations for rapid prototyping of embedded applications on hardware with instruction-level parallelism, as well as on adaptive targets. A significant portion of this work is aimed at programming support for embedded applications. In these areas, he has corporate awards for excellence from Hewlett-Packard, IBM and Panasonic.

Scalability of a Distributed Memory Finite-Difference Time-Domain Algorithm for the Solution of Maxwell's Equations

Jethro H. Greene
Advisee of Dr. Allen Taflove,
Graduate student,
Northwestern University,
Illinois, USA.

Abstract

Solving Maxwell's equations is essential for understanding communication, radar, high-speed electronics, wave interactions with human tissue, and optical switching. However, modeling them with sufficient resolution on complex geometries requires extreme computational expense. The scalability of a finite-difference time-domain algorithm for solving Maxwell's equations was studied on both shared- and distributed-memory parallel computers and its dependency on communication bandwidth and latency was determined.

Design and Evaluation of Network Interfaces for System Area Networks

Shubhendu S. Mukherjee
Computer Sciences Department,
University of Wisconsin,
1210 West Dayton Street, Madison,
Wisconsin, USA.

Abstract

Clusters of Workstations (COWs) provide a low-cost alternative to message-passing parallel computers. COWs connect multiple commodity workstations with a System Area Network (SAN). The high performance and high reliability of a SAN provides a COW with the potential to efficiently support the low-latency and fine-grain communication that arises in many parallel applications.

Unfortunately, a workstation's poor interaction with a network interface (NI) significantly impedes harnessing the benefits of a SAN. An NI is a device that allows a workstation to send and receive messages from a network. Conventional workstation NIs suffer from several sources of high latency because they were designed with an interface similar to a disk's interface.

During this talk I will outline techniques to improve the performance of interactions between a workstation and an NI. A key principle underlies these techniques: an NI access should be treated like a memory access, and not like a disk interface access. I have captured this principle in a novel NI design called Coherent Network Interfaces (CNIs). CNIs interact with processors and memory of a workstation using coherent, cachable memory operations. My simulation results show that when compared to a more conventional NI, a CNI can improve the performance of seven parallel scientific applications between 6-190%.

Biographical sketch

Shubhendu S. Mukherjee (http://www.cs.wisc.edu/~shubu) is a Ph.D. candidate at the University of Wisconsin-Madison. His dissertation research focussed on network interfaces for message-passing parallel computers. He has also worked extensively on cache coherence protocols as part of his research with the Wisconsin Wind Tunnel project. Mukherjee received an MS from the University ofWisconsin-Madison in 1993 and a BTech from the Indian Institute of Technology, Kanpur in 1991.

Memory Dependence Prediction

Andreas Moshovos
Computer Sciences Department,
University of Wisconsin,
1210 West Dayton Street, Madison,
Wisconsin, USA.

Abstract

Programs are written with the abstraction of uniform, fast access to all memory locations via the use of addresses. Unfortunately, not all of memory can be built to match the processor's speed. To approximate a large and fast memory, modern computer systems employ memory hierarchies (e.g., several levels of caches). Traditionally, these memory hierarchies are designed to exploit characteristics of the program's address stream (e.g., caches exploit the locality in the address stream of ordinary programs). However, the exponential growth in processor performance that we continue to enjoy challenges these memory hierarchies creating a need for further improvement.

In this presentation, I revisit memory hierarchy design and view memory as an inter-operation communication agent: memory is often used to communicate values among program operations. This perspective exposes limitations that are inherent in the use of an address based memory hierarchy. To overcome these limitations I introduce Memory Dependence Prediction, a technique that transparently captures the communication relationships among memory accessing operations. Memory dependence prediction enables a number of novel techniques that improve upon traditional memory hierarchies. In this presentation I will focus on the following two techniques: (1) Dynamic Speculation and Synchronization, and (2) Speculative Memory Cloaking. The first technique exposes the parallelism that is hindered by the use of addresses. The second technique modifies memory operations as they are encountered for execution so that they can communicate through a low latency communication mechanism.

I argue that these techniques represent a first step toward exploiting the behavior of the program's operation to improve memory hierarchies.

Biographical sketch

Andreas Moshovos received his undergraduate and M.Sc. degrees in Computer Science from the University of Crete, Greece in 1990 and 1992 respectively. He then joined the Computer Science Department of New York University as a Ph.D. candidate. Since July 1993 he is a Ph.D. candidate at the Computer Sciences Department of the University of Wisconsin-Madison where he works under the supervision of Prof. Gurindar S. Sohi and as a member of the Multiscalar Architecture group. He expects to graduate with a Ph.D. degree in Computer Science by September of 1998. His thesis research introduced memory dependence prediction and focuses on micro-architectural techniques to improve upon traditional memory hierarchies.

Metacomputing Environments for Optimization

Dr. Jorge Nocedal
Deputy Director, Optimization Technology Center,
Department of Electrical and Computer Engineering,
Northwestern University,
Illinois, USA.

Abstract

The Network-Enabled Optimization System (NEOS) is an environment for modeling and solving optimization problems on the Internet. It offers a collection of servers geared to a variety of optimization problems, and also allows for different user interfaces. Our interest is to make NEOS more versatile by allowing interactions with user programs, and to make it more powerful by solving very large application problems in a distributed and heterogenous network.

In this talk we describe the areas of optimization we plan to study, and outline how Globus and Condor will be adapted to serve our purposes.

Biographical sketch

Improving Data Supply for Multi-Issue Processors

Jude A. Rivers
Advanced Computer Architecture Laboratory,
Department of Electrical Engineering and Computer Science,
The University of Michigan at Ann Arbor,
Michigan, USA.

Abstract

To take advantage of advances in VLSI technology for higher performance, current and future high-end microprocessors are being designed to issue and execute multiple instructions per cycle. For example, processors capable of issuing 16 instructions per cycle are being discussed. At the same time, memory speeds are not increasing as fast as processor speeds. With memory operations accounting for about a third of the average instruction stream, more and more demand is being placed on the data memory hierarchy. In particular, on-chip caches commonly serve as the head of the data memory hierarchy. An on-chip data cache, however, is only beneficial if it can supply the requested data within a short cycle time. With these emerging data supply requirements, there is a great need for data cache structures that can effectively minimize the average data access times and supply multiple data in a single cycle.

This talk will be presented in two parts. In the first part, I will introduce the Non-Temporal Streaming (NTS) Cache. The NTS Cache is an example of multi-lateral cache structures that partition the first level (L1) cache into multiple subcaches. For these designs, the data reference stream of a program is subdivided into appropriate classes, and each class is mapped into a specific subcache whose management policy is more suitable for the access patterns and/or usage characteristics of that class. This sort of selective organization and caching actively retains more useful data in the L1 cache structure, which translates into more cache hits, less cache-memory bus contention, and overall improvement in the average data access time. The NTS Cache utilizes data reuse information for intelligent data placement and active management of its 2-unit lateral structure. I will compare and contrast the NTS Cache with other proposed multi-lateral cache designs, and also show that the multi-lateral design approach generally provides an attractive alternative to the current trend of increasingly large but poorly managed caches.

The second part of the talk will explore the performance and limitations of current approaches to multiple data supply. Currently, multibanking and multiporting by replication are the two popular and implementable approaches to providing multiple ports to the data cache. However, these approaches do not appear scalable with increasing data access parallelism. Whereas the multibanking technique suffers substantial performance degradation because of bank conflicts, multiporting by replication is die area limited and plagued by the need to broadcast stores for coherence. Analysis of the average SPEC95 memory reference stream, however, reveals that a greater majority of all conflicts in a multibank cache is as a result of consecutive references that map into the same cache line of the same cache bank. I will introduce the Locality-Based Interleaved Cache (LBIC), which is built on traditional multibanking and employs limited multiporting in exploiting same line spatial locality. Our detailed simulation results suggest that the LBIC structure is capable of outperforming current multiporting approaches.

I will conclude this talk by exploring the data supply potential of cache memory structures that integrate both the NTS and LBIC schemes.