Memory Access Phases Prediction for Chip Multiprocessor

 

 

A Quick Overview

In modern Chip Multi Processors (CMPs), communication delay poses a main source of system performance bottleneck. Applications switch between different phases through the whole execution duration. When an application is in high memory phase, corresponding processor core accesses the shared memory resource heavily. In order to utilize the shared memory resource efficiently, a well-designed system should try to reduce the period when several applications are in high memory phase simultaneously.  A straightforward approach is to give applications that will exit high memory phase soon high priority.  In this report, we propose a neural network based technique to predict the length of high memory phases. Particularly, our algorithm predicts the number of instructions will be executed before the application enters low memory phase.

      

More details…

Motivation
Prediction of when high and low memory phases happen during the execution of multiple applications allows better scheduling to avoid conflict in the usage of share memory resources.

Methodology
We use Artificial Neural Network (ANN) to implement a predictor.  The inputs to the predictor are the number of instruction executed, number of L1 instruction and data cache misses.  The output will be the number of instructions executed before the next low memory phase.

Results
Our experimental results show that we could obtain more than 90% accuracy using ANN if we have a large enough data set.  The accuracy of ANN algorithm in predicting memory phases depends on the size of the training data set.  Therefore, ANN algorithm will be suitable for static training and dynamic prediction.

Conclusions
Accurate but costly prediction.

 

 


Motivation

In modern Chip Multi Processors (CMPs), communication delay, especially the delay between the processor and the main memory is the bottleneck of system performance. Complex memory hierarchies have been designed to feed the processor cores with data. However, shared memory resources, such as lower level cache and main memory bandwidth, are still the main resource of performance penalty in current CMPs. Since larger caches have longer access delay, simply increasing the size of cache will not always improve system performance. Well-designed memory systems should optimize the access efficiency of shared memory resource.

In CMPs, several processor cores are accessing the share memory resources simultaneously. In most cases, processes are switching between phases with different memory access pattern. In some phases, the hit rate of level 1 cache is very high, which means the process does not access L2 shared cache frequently. In some other phases, the hit rate of L1 cache is low and the process accesses the cache frequently. We call these two kinds of phases Low Memory Phases (LMPs) and High Memory Phase (HMPs). The system performance is maximized when the HMPs of different processor cores do not conflict with each other. In this scenario, when a process is in HMP, it could enjoy a large cache and memory bandwidth. On the other hand, the system performance is bad when HMPs of different processor cores conflict with each other.

Our task in this project is to use machine learning algorithm to predict which process will exit HMP first. Particularly, we use a neural network based algorithm to predict whether an application will exit the high memory phase in a given time. Initially, we collect execution traces of different programs.  The execution of an application could be divided into smaller time intervals. In each interval, multiple metrics representing the memory usage of the application are collected. Initially, these metrics in each interval are used as attributes to train our machine learning algorithm.  The target function that we try to predict is whether the application will still execute in HMP after a given period.  After a large number of training intervals, we could use machine learning to predict what will happen in future intervals.

 

 

 

 

 

 

 

 

 


Methodology

The system works as follows. We count several important metrics of the processor and output results each 10 k cycles. Some interesting metrics in process cores are cache miss rates and numbers of load/store instructions, the number of instructions completed in the interval. For simplicity, we keep track on number of accesses to L1 icache and dcache, accesses L2 cache and number of instructions executed in the time interval. We categorize these intervals as HMP interval or LMP interval by assigning threshold on the number of L2 accesses.

We implement an ANN predictor using Fast Artificial Neural Network Library (FANN).  The network is fully connected with 1 input layer of 3 neurons, 1 hidden layer of 8 neurons, and 1 output layer of 1 neuron, which will output the prediction.  We analyzed the data sets and decide which phases are high and low using a memory threshold.  This threshold is decided such that we can see the largest number of transitions to make our investigation more interesting.  From a system designer point of view, the memory threshold should be decided based on system characteristics and memory behavior of other concurrent processes.  The data set for each benchmark is divided into a training set and testing set.  We vary the size of these sets to study the effect of data set on prediction accuracy.  We will use Mean Squared Error (MSE) as a metric to evaluate the algorithm.  The number of instructions is normalized by 1 million instructions, so an MSE of 0.1 means on average, the difference in prediction is 10000 instructions. 

 


Results

We train with data sets from 6 SPECCPU 2000 applications.  Of the 6 benchmarks, only 3 benchmarks ammp, applu and apsi have large data sets, representing more interesting memory behavior.  The results show that we obtain very high accuracy when the training set is three times the size of the testing set.  But when the training set is 3 times smaller, we still get very good result on ammp, but very poor accuracy on apsi.

Benchmark

Memory Threshold

Number of Instances

3 Train: 1Test

1 Train: 3Test

ammp

0.3

41236

0.065

0.045

applu

0.3

205728

0.06

0.21

apsi

0.3

86680

0.23

0.74

galgel

0.4

2290

0

0.03

lucas

0.4

2290

0

0.03

swim

0.9

6592

0.03

0

The above results show that with larger training set, we obtain better accuracy in prediction.  To further investigate the effect of data set size on prediction accuracy, we change the training data set by vary the number of instance from 1/6 to 1 fraction of the number of instances in the 3Train:1Test case and run the predictor on two applications applu and apsi.  The graph shows that the prediction accuracy depends on the training data set and some applications might need larger training set than the other.

 


Conclusions

Our experiment shows that ANN could be used to give very high prediction accuracy for memory phase transitions in an application.  However, due to the requirement of large data set and training time, it is more suitable for static training and dynamic execution.   For more details information, please refer to our paper.