A Quick Overview
In modern Chip
Multi Processors (CMPs), communication delay poses a main source of
system performance bottleneck. Applications switch between different
phases through the whole execution duration. When an application is in
high memory phase, corresponding processor core accesses the shared
memory resource heavily. In order to utilize the shared memory resource
efficiently, a well-designed system should try to reduce the period when
several applications are in high memory phase simultaneously. A
straightforward approach is to give applications that will exit high
memory phase soon high priority. In this report, we propose a
neural network based technique to predict the length of high memory
phases. Particularly, our algorithm predicts the number of instructions
will be executed before the application enters low memory phase.
|
|
More details…
Motivation
Prediction of when high and low memory phases happen during the execution
of multiple applications allows better scheduling to avoid conflict in
the usage of share memory resources.
Methodology
We use Artificial Neural Network (ANN) to implement a predictor. The inputs to the predictor are
the number of instruction executed, number of L1 instruction and data
cache misses. The output
will be the number of instructions executed before the next low memory
phase.
Results
Our experimental results show that we could obtain more than 90%
accuracy using ANN if we have a large enough data set. The accuracy of ANN algorithm in
predicting memory phases depends on the size of the training data
set. Therefore, ANN
algorithm will be suitable for static training and dynamic prediction.
Conclusions
Accurate but costly prediction.
|
Motivation
In
modern Chip Multi Processors (CMPs), communication delay, especially the
delay between the processor and the main memory is the bottleneck of
system performance. Complex memory hierarchies have been designed to feed
the processor cores with data. However, shared memory resources, such as
lower level cache and main memory bandwidth, are still the main resource
of performance penalty in current CMPs. Since larger caches have longer
access delay, simply increasing the size of cache will not always improve
system performance. Well-designed memory systems should optimize the
access efficiency of shared memory resource.
In CMPs, several processor cores are
accessing the share memory resources simultaneously. In most cases,
processes are switching between phases with different memory access
pattern. In some phases, the hit rate of level 1 cache is very high,
which means the process does not access L2 shared cache frequently. In
some other phases, the hit rate of L1 cache is low and the process
accesses the cache frequently. We call these two kinds of phases Low
Memory Phases (LMPs) and High Memory Phase (HMPs). The system performance
is maximized when the HMPs of different processor cores do not conflict
with each other. In this scenario, when a process is in HMP, it could
enjoy a large cache and memory bandwidth. On the other hand, the system
performance is bad when HMPs of different processor cores conflict with
each other.
Our task in this project is to use machine learning algorithm to
predict which process will exit HMP first. Particularly, we use a neural
network based algorithm to predict whether an application will exit the
high memory phase in a given time. Initially, we collect execution traces
of different programs. The
execution of an application could be divided into smaller time intervals.
In each interval, multiple metrics representing the memory usage of the
application are collected. Initially, these metrics in each interval are
used as attributes to train our machine learning algorithm. The target function that we try
to predict is whether the application will still execute in HMP after a
given period. After a large
number of training intervals, we could use machine learning to predict
what will happen in future intervals.
|
|
Methodology
The system
works as follows. We count several important metrics of the processor and
output results each 10 k cycles. Some interesting metrics in process
cores are cache miss rates and numbers of load/store instructions, the
number of instructions completed in the interval. For simplicity, we keep
track on number of accesses to L1 icache and dcache, accesses L2 cache
and number of instructions executed in the time interval. We categorize
these intervals as HMP interval or LMP interval by assigning threshold on
the number of L2 accesses.
|
|
We implement an ANN predictor using
Fast Artificial Neural Network Library (FANN). The network is fully connected with 1 input layer of 3
neurons, 1 hidden layer of 8 neurons, and 1 output layer of 1 neuron,
which will output the prediction.
We analyzed the data sets and decide which phases are high and low
using a memory threshold. This
threshold is decided such that we can see the largest number of
transitions to make our investigation more interesting. From a system designer point of
view, the memory threshold should be decided based on system
characteristics and memory behavior of other concurrent processes. The data set for each benchmark
is divided into a training set and testing set. We vary the size of these sets to study the effect of
data set on prediction accuracy.
We will use Mean Squared Error (MSE) as a metric to evaluate the
algorithm. The number of
instructions is normalized by 1 million instructions, so an MSE of 0.1
means on average, the difference in prediction is 10000 instructions.
|
Results
We train with data sets from 6 SPECCPU 2000 applications. Of the 6 benchmarks, only 3
benchmarks ammp, applu and apsi have large data sets, representing more
interesting memory behavior.
The results show that we obtain very high accuracy when the
training set is three times the size of the testing set. But when the training set is 3
times smaller, we still get very good result on ammp, but very poor
accuracy on apsi.
Benchmark
|
Memory Threshold
|
Number of
Instances
|
3 Train: 1Test
|
1 Train: 3Test
|
ammp
|
0.3
|
41236
|
0.065
|
0.045
|
applu
|
0.3
|
205728
|
0.06
|
0.21
|
apsi
|
0.3
|
86680
|
0.23
|
0.74
|
galgel
|
0.4
|
2290
|
0
|
0.03
|
lucas
|
0.4
|
2290
|
0
|
0.03
|
swim
|
0.9
|
6592
|
0.03
|
0
|
The above results show that with larger
training set, we obtain better accuracy in prediction. To further investigate the effect
of data set size on prediction accuracy, we change the training data set
by vary the number of instance from 1/6 to 1 fraction of the number of
instances in the 3Train:1Test case and run the predictor on two
applications applu and apsi.
The graph shows that the prediction accuracy depends on the
training data set and some applications might need larger training set
than the other.
|
Conclusions
Our experiment shows that ANN could be used to give very high prediction
accuracy for memory phase transitions in an application. However, due to the requirement
of large data set and training time, it is more suitable for static
training and dynamic execution. For more details information, please refer to
our paper.
|
|
|