performance_clusters.tex
20.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
\section{Performance Clusters}
\label{sec-perf-clusters}
Tracking the best performance settings for a given inefficiency budget is
expensive. In this section, we study how we can amortize the cost by trading-off
some performance. We define the concept of \textit{performance clusters}.
All frequency settings (CPU and memory frequency pairs) that have performance
within a performance degradation threshold (\textit{cluster threshold})
compared to the performance of the optimal settings for a given inefficiency budget
form the \textit{performance cluster} for that inefficiency constraint. We define
the term \textit{stable regions}
%as well here that we will use for the rest of
%the paper. \textit{Stable regions} are
as regions in which at least one pair of CPU
and memory frequency settings is common among all samples in the region.
%We measure the length of the
%stable regions in terms of number of samples.
In this section, we first study the trends in performance clusters for multiple
applications. Then we characterize the stable regions and explore the
implications of using stable regions on energy-performance trade-offs for
multiple inefficiencies and cluster thresholds. In the end, we study the
sensitivity of performance clusters to number of frequency settings available in
the system.
\subsection{Performance Clusters}
We search for the performance clusters using an algorithm that is similar to the approach we used to find the optimal settings. We
first filter the settings that fall within a given inefficiency budget and
then search for the optimal settings in the first pass. In the second pass, we find all of the
settings that have a speedup within the specified cluster threshold of the optimal performance.
Figures~\ref{clusters-gobmk},~\ref{clusters-milc} plot the performance
clusters during the execution of the benchmarks \textit{gobmk} and \textit{milc}. We
plot inefficiency budgets of 1 and 1.3 and cluster thresholds of 1\% and 5\%. For
our benchmarks, we observed that the maximum achievable inefficiency is anywhere from 1.5 to 2. We
chose inefficiency budgets of 1 and 1.3 to cover low and mid inefficiency
budgets. %, as energy distribution among components becomes critical to extract best performance.
Cluster thresholds of 1\% and
5\% allow us to model the two extremes of tolerable performance degradation bounds.
A cluster threshold of less than 1\% may limit the ability to tune less often.
While cluster thresholds greater than 5\% are probably not realistic as user is already
compromising performance by setting low inefficiency budgets to save energy.
Figures~\ref{clusters-gobmk}(c),~\ref{clusters-gobmk}(d) plot the
performance clusters for \textit{gobmk} for inefficiency budget of 1.3 and
cluster thresholds of 1\% and 5\% respectively. As we observed in
Figure~\ref{gobmk-optimal}, the optimal settings for \textit{gobmk} change
every sample (of length 10 million instructions) at inefficiency of 1.3 and follow
application phases (CPI). Figure~\ref{clusters-gobmk}(c) shows that by
allowing just 1\% performance degradation, the number of settings
available to choose from increase. For example, for sample 11, the
optimal settings were at 920MHz CPU and 580MHz memory. With 1\%
cluster threshold, the range of available frequencies increases to
900MHz-9200MHz for CPU and 420MHz-580MHz for memory. With a 5\%
cluster threshold, the range of available frequencies increases
further as shown in Figure~\ref{clusters-gobmk}(d). With an increase in number of available settings, the
probability of finding common settings in two consecutive samples
increases, allowing the system to stay at one frequency setting for a
longer time. For example, the optimal settings changed between samples
24 and 25, however with cluster threshold of 5\% CPU and memory
frequency can stay fixed at 750MHz and 800MHz respectively. The higher
the cluster threshold is, the higher the length of the stable regions
would be.
Figures~\ref{clusters-gobmk}(a),~\ref{clusters-gobmk}(c) plot the performance
clusters for \textit{gobmk} for two different inefficiency budgets of 1.0 and 1.3 for
cluster threshold of 1\%.
%\XXXnote{reword next sentence? -Dave}
Not all of the stable regions increase in length with increasing inefficiency;
this trend varies with workloads.
%Increase in the length of stable regions with increase in
%inefficiency is a
%function of workload characteristics.
If consecutive
samples of a workload have a small difference in performance, but differ significantly in energy
consumption, then only at
higher inefficiency budgets will the system find common settings for these
consecutive samples. % because all settings under an inefficiency budget are considered.
%Note that we find the performance clusters by considering
%the settings that fall under a given inefficiency budget.
This is because,
the performance clusters of higher inefficiencies can include settings operating at
lower inefficiencies as long as their performance degradation is within the
cluster threshold. For example, the memory frequency oscillates for samples
32-39 for \textit{gobmk}
at inefficiency budget of 1.0, while the system could stay fixed at 800MHz memory at inefficiency
of 1.3. However, for workload phases that result in high performance difference
in consecutive samples at given pair of frequency settings, higher inefficiency
budgets might not help as there might not be any common frequency pairs that have
performance within set cluster threshold (for example samples 3-5 in
Figures~\ref{clusters-gobmk}(a),~\ref{clusters-gobmk}(c)).
Figure~\ref{clusters-milc} shows that \textit{milc} has similar trends as
\textit{gobmk}.
\begin{figure}[t]
\centering
\includegraphics[width=\columnwidth]{./figures/plots/496/stable_line_plots/lbm_stable_lineplot_annotated_5.pdf}
\vspace{-0.5em}
\caption{\textbf{Stable Regions and Transitions for \textit{lbm} with
Threshold of 5\% and Inefficiency Budget of 1.3:} Solid lines represent the
stable regions and vertical dashed lines mark the transitions made by
\textit{lbm}.}
\label{lbm-stable-line-5-annotated}
\end{figure}
\begin{figure*}[t]
\begin{subfigure}[t]{\textwidth}
\centering
\vspace{-1em}
\includegraphics[width=\columnwidth]{{./figures/plots/496/stable_line_plots/stable_lineplot}.pdf}
\end{subfigure}%
\vspace{0.5em}
\caption{\textbf{Stable Regions of \textit{gcc} and \textit{lbm} for
Inefficiency Budget of 1.3:} Increase in cluster threshold increases the length
of the stable regions, which eventually leads to less transitions. Higher
inefficiency budgets allow system to run unconstrained throughout.}
\label{stable-regions}
\end{figure*}
An interesting observation from the performance clusters is that algorithms
like CoScale~\cite{deng2012coscale} that search for the best performing settings every interval starting
from the maximum frequency settings are not efficient. Algorithms can reduce the
overhead of optimal settings search by starting search from the settings selected
for the previous interval as application phases are often stable for multiple sample intervals.
%as the application phases don't change drastically in
%short intervals.
\begin{figure*}[t]
\begin{subfigure}[t]{\textwidth}
\centering
\includegraphics[width=\columnwidth]{{./figures/plots/496/stability_metrics/transition_percent_across_thresholds}.pdf}
\end{subfigure}
\vspace{-0.5em}
\caption{\textbf{Number of Transitions with Varying Inefficiency Budgets and
Cluster Thresholds:} The number of frequency transitions decrease with increase in cluster
threshold. The amount of change varies with benchmark and inefficiency budget.}
%Decrease in number of transitions is dependent on benchmark and
%inefficiency budget.}
\label{transitions-bar}
\end{figure*}
\begin{figure*}[t]
\begin{subfigure}[t]{\textwidth}
\centering
\includegraphics[width=\columnwidth]{{./figures/plots/496/stable_length_box/stable_length_box}.pdf}
\end{subfigure}%
\vspace{0.5em}
\caption{\textbf{Distribution of Length of Stable Regions:} The average length of
stable regions increases with cluster threshold.
%While inefficiency budget has significant impact on length of stable regions
%for \textit{lbm}, it doesn't change the length of stable regions of
%\textit{gobmk} much.
}
\label{box-lengths}
\end{figure*}
\subsection{Stable Regions}
So far, we have made observations by
looking at CPU and memory frequencies separately and finding (visually) where
they \textit{both} stay stable. However, when either of the memory or CPU performance clusters move,
the system needs to make a transition. Looking at plots of individual components does not provide a clear picture of the stable regions of the entire CPU and memory system.
Figures~\ref{clusters-gobmk} and~\ref{clusters-milc} plot the
performance clusters and not the stable regions.
%give
%an idea of stable regions(recall that stable regions are defined for pairs of
%CPU and memory frequency settings).
We wrote an algorithm to find all of the stable regions for a given application. It starts
by computing performance clusters for a given sample and moves ahead sample by
sample. For every sample it computes available settings by finding the common
settings between the current sample performance cluster and the available
settings until the previous sample. When the algorithm finds no more common
samples, it marks the end of the stable region. If more than one frequency pair
exists in the available settings for this region, the algorithm chooses the
setting with highest CPU (first) and then memory frequency as optimal settings for this
region. Figure~\ref{lbm-stable-line-5-annotated} shows the CPU and memory frequency
settings selected for stable regions of benchmark \textit{lbm}. It also has
markers indicating the end of each stable region. In this figure, note that for
every stable region (between any two markers) the frequency of both CPU and memory stay
constant.
%Note that
Our algorithm is not practical for real systems, as it knows the characteristics of the
future samples and their performance clusters in the beginning of a stable
region. % (and therefore is impractical to implement in real systems).
We are
currently designing algorithms in hardware and software that are capable of tuning the system while
running the application as future work. In Section~\ref{sec-algo-implications}, we
propose ways in which length of stable regions and the available settings for a
given region can be predicted for energy management algorithms in real systems.
Figure~\ref{stable-regions} plots stable regions for benchmarks \textit{gcc} and
\textit{lbm} for multiple inefficiency budgets and cluster thresholds. With
increase in cluster threshold from 3\% to 5\% there is a significant drop in the
number of transitions made by \textit{gcc} at lower inefficiency budgets. At
higher inefficiency budgets, algorithms can choose the highest available
frequency settings and therefore, even if a higher cluster threshold is
allowed, we don't observe any changes in the selection of settings. The relative
number of transitions made by \textit{lbm} decreases with an increase in cluster
threshold, however, the absolute number of transitions compared to other
benchmarks does not decrease significantly as it doesn't have too many
transition to start with at 3\%. Like our previous observation, the number of
transitions also decreases with increasing inefficiency for these two
benchmarks. This shows that there is a high number of consecutive samples that have
similar performance but different inefficiency at the same CPU and memory frequency
settings. A decrease in the number of transitions is a result of an increase in
the length of stable regions.
Figure~\ref{transitions-bar} summarizes the number of transitions per
billion instructions for multiple cluster thresholds and inefficiency budgets across
benchmarks. As the figure shows, tracking the optimal frequency settings results
in highest number of transitions. A common observation is that the number of transitions required decreases with an
increase in cluster threshold. For \textit{bzip2}, increase in inefficiency from
1.0 to 1.3 increases the number
of transitions needed to track the optimal settings. The number of
available settings increase with inefficiency increasing the average length of
stable regions. At an inefficiency budget of 1.6, the average length of a stable region
increases drastically as shown in Figure~\ref{box-lengths}(b), which requires much less
transitions with 1\% cluster threshold and no transitions with higher cluster thresholds of 3\%
and 5\%. Note that there is only one point on the box plot for 3\% and 5\%
cluster thresholds at inefficiency of 1.6, because the benchmark is covered entirely by only one region.
%and therefore no distribution is available.
However, \textit{gobmk} has rapidly changing phases and
therefore, with an
increase in either inefficiency or cluster thresholds, there is not much of an increase
that we observe in stable region lengths as shown in Figure~\ref{box-lengths}(a).
Therefore the number of transition per billion instructions decrease only slightly
with increase in cluster threshold and inefficiency budget for \textit{gobmk}.
Figure~\ref{box-lengths}(c) summarizes the distribution of stable region lengths observed
across benchmarks for multiple cluster thresholds at inefficiency budget of 1.3.
\begin{figure}[t]
% \begin{subfigure}[t]{0.45\textwidth}
% \centering
% \includegraphics[width=\columnwidth]{{./figures/plots/496/energy_perf_bar/energy_bar_normalized_0.0_0_0}.pdf}
% \vspace{-2.5em}
% \end{subfigure}%
% \begin{subfigure}[t]{0.45\textwidth}
\centering
\includegraphics[width=\columnwidth]{{./figures/plots/496/energy_perf_bar/performance_bar_normalized_0.0_0_0}.pdf}
% \vspace{-2.5em}
% \end{subfigure}%
\caption{\textbf{Variation of Performance with Inefficiency:} Performance
improves with increase in inefficiency budget, but the amount of improvement varies across benchmarks.
%\XXXnote{Is this description sufficient?}
%System always stays under the given inefficiency budget.
%Maximum inefficiency consumed by the system is bounded by the application and
%devices.
}
\label{ineff-perf}
\end{figure}
\begin{figure*}[t]
\begin{subfigure}[t]{\textwidth}
\centering
\includegraphics[width=\columnwidth]{{./figures/plots/496/energy_perf_bar/energy_perf_bar_1.3}.pdf}
\end{subfigure}
\vspace{-0.5em}
\caption{\textbf{Energy-performance Trade-offs for Inefficiency Budget of 1.3,
Multiple Cluster Thresholds:} Performance degradation is always within the
cluster threshold. Allowing small degradation in performance reduces energy
consumption, which decreases further when tuning overhead is included.}
\label{energy-perf-trade-off}
\end{figure*}
\subsection{Energy-Performance Trade-offs}
In this subsection we analyze the energy-performance trade-offs made by our
ideal algorithm. We then add the tuning cost of our algorithm and compare the
energy performance trade-offs across multiple applications. We study multiple
cluster thresholds and an inefficiency budget of 1.3.
% MH: We cut the energy plot so we don't need this text
%
First, we demonstrate that our tuning algorithm was successful in selecting the right
settings and thereby keeping the system under the specified inefficiency
budget and summarize the total performance achieved.
%Figure~\ref{ineff-perf} plots the inefficiency that applications consumed
%when no degradation in performance is allowed.
We ran a set of simulations and verified that all applications ran under
the given inefficiency budget for all the inefficiency budgets.
%Applications
%don't consume all of the specified budget for two reasons. First, as the
%available frequency settings space is not continuous, it is possible that there
%are no settings available that completely exhaust the given inefficiency budget,
%so the algorithm ends up choosing settings that consume less energy.
%Second,
%given that $I_{max}$ is bounded, there is a upper limit on how much of the
%energy can be consumed.
Figure~\ref{ineff-perf} shows that the higher the
inefficiency budget is, the lower the execution time is, making smooth
energy-performance trade-offs.
Figure~\ref{energy-perf-trade-off} plots the total performance degradation and
energy savings for multiple cluster thresholds with and without tuning overhead
for inefficiency budget of 1.3. Both performance degradation and energy savings
are computed relative to the performance and energy of the application running
at optimal settings. Figures~\ref{energy-perf-trade-off}(a) and
~\ref{energy-perf-trade-off}(b) show that our algorithm is successful in
selecting the settings that don't degrade performance more than specified
cluster threshold. The figure also shows that with an increase in cluster
threshold, energy consumption decreases because lower frequency settings can be
chosen at higher cluster thresholds. Figure~\ref{energy-perf-trade-off}(b) shows
that performance (and energy) may improve when tuning overhead is included due
to decrease in frequency transitions. To determine tuning overhead, we wrote a
simple algorithm to find optimal settings. With search space of 70 frequency
settings, it resulted in tuning overhead of 500us and 30uJ, which includes computing inefficiencies, searching for the optimal setting
and transition the hardware to new settings.
%This is not intuitive and we are investigating the cause of this anomaly
%\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find
%the reason for the change}.
We summarize our observations from this section here:
\begin{enumerate}
\item With an increase in cluster threshold, the range of available frequencies increases,
which increases the probability of finding common settings in consecutive
samples. This results in longer stable regions.
\item The increase in the length of stable regions with an increase in inefficiency depends on the workload.
\item The number of transitions required is dictated by the average length of the
stable region. The longer the stable regions, the lower
the number of transitions that the system need to make.
\item Allowing a higher degradation in performance may, in fact, result in improved
performance when tuning overhead is included due to reduction in
number of frequency transitions in the system, consequently energy savings also
increase.
\end{enumerate}
\subsection{Sensitivity of Performance Clusters to Frequency Step Size}
In this section we study the sensitivity of the performance clusters to number
of frequency steps or frequency step sizes available in a given system. We computed performance clusters offline
and analyzed the difference between clusters with coarse frequency steps and a clusters with more frequency steps.
Figure~\ref{sensitivity} plots performance clusters for \textit{gobmk} at
inefficiency of 1.3 and cluster threshold of 1\%. We chose 1\%
for our sensitivity analysis, as the trends in performance clusters are more explicit
at low cluster thresholds.
Figure~\ref{sensitivity}(a) plots the clusters collected with a 100MHz
frequency step for both the CPU and the memory, which is a total of 70 possible
settings. The clusters in
Figure~\ref{sensitivity}(b) are collected with 30MHz steps for CPU frequency and
40MHz steps for memory frequency, for a total of 496 settings. We observed that the average
cluster length either remains the same or decreases with increase in number of
steps. With increase in number of frequency steps, there are more choices
available to make better energy-performance trade-offs. Therefore average number
of samples for which one setting can be chosen decreases. For example, with 70
frequency settings sample 7 through sample 10 can always run at CPU frequency of 900MHz
and memory frequency of 300MHz. With 496 frequency settings, sample 7
runs at one setting, sample 8-9 runs at another setting and sample 10 runs at a
different setting due to the availability of more (and better) choices.
%\XXXnote{sounds wordy -Dave}.
In our system, we observed only a small improvement in performance (\textless
1\%) with an increased number of frequency steps when
tuning is free, as optimal
settings in both cases were off by only a few MHz. It is the balance between the
tuning overhead and the energy-performance savings that is
critical in deciding the correct size of the search space.
\begin{figure}[t]
\begin{subfigure}[t]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{{./figures/plots/496/cluster_of_best_points/cluster_of_stable_points_freq_sens_gobmk}.pdf}
\vspace{-1em}
\end{subfigure}%
\vspace{0.5em}
\caption{\textbf{Performance Clusters at Two Different Frequency Steps}:
Figure~(a) plots performance clusters collected using 100MHz of frequency step for both CPU and
memory. Figure~(b) plots
performance clusters collected using frequency steps of 30MHz for CPU and 40MHz
for memory.}
\label{sensitivity}
\end{figure}