M. Kandemir, J. Ramanujam, and A. Choudhary
Abstract
Exploiting locality of reference is key to realizing high levels of performance on modern processors. This paper describes a compiler algorithm to optimize cache locality in scientific codes on uniprocessor and multiprocessor machines. A distinctive characteristic of our algorithm is that it considers loop and data layout transformations in a unified framework. We illustrate through examples that our approach is very effective at reducing cache misses and tile-size sensitivity of blocked loop nests; and can optimize some nests for which optimization techniques based on loop transformations alone are not successful. An important special case is the one in which data layouts of some arrays are fixed and cannot be changed. We show how our algorithm can handle this case, and demonstrate how it can be used to optimize multiple loop nests. We use a simple heuristic for optimizing whole programs. Experiments on several benchmarks show that the techniques presented in this paper result in substantial improvement in cache performance.