optimal_performance.tex
6.05 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
\section{Performance Under An Inefficiency Budget}
\label{sec-opt-perf}
\begin{figure}[t]
\centering
\includegraphics[width=\columnwidth]{figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff_cpi_mpki.pdf}
\vspace{-0.5em}
\caption{\textbf{Optimal Performance Point for \text{Gobmk} Across Inefficiencies:} At
low inefficiency budgets, the optimal frequency settings follow CPI of the
application, and select high memory frequencies for memory intensive phases with
high CPI.
%to deliver best
%performance under given inefficiency constraint.
Higher inefficiency budgets
allow the system to run always at the maximum CPU and memory frequencies.}
\label{gobmk-optimal}
\end{figure}
\begin{figure*}[t]
\begin{subfigure}[t]{\textwidth}
\includegraphics[width=\halfwidth]{{./figures/plots/496/cluster_of_best_points/cluster_of_stable_points_gobmk_1.0}.pdf}
\hspace{-0.5em}
\includegraphics[width=\halfwidth]{{./figures/plots/496/cluster_of_best_points/cluster_of_stable_points_gobmk_1.3}.pdf}
\end{subfigure}%
\vspace{0.5em}
\caption{\textbf{Performance Clusters for \textit{gobmk.}} With increase in
cluster threshold, the number of available frequency settings increase,
eventually leading to fewer transitions. Increase in cluster size with
inefficiency budget is a function of workload.}
\label{clusters-gobmk}
\end{figure*}
\begin{figure*}[t]
\begin{subfigure}[t]{\textwidth}
\includegraphics[width=\halfwidth]{{./figures/plots/496/cluster_of_best_points/cluster_of_stable_points_milc_1.0}.pdf}
\hspace{-0.5em}
\includegraphics[width=\halfwidth]{{./figures/plots/496/cluster_of_best_points/cluster_of_stable_points_milc_1.3}.pdf}
\end{subfigure}%
\vspace{0.5em}
\caption{\textbf{Performance Clusters of \textit{milc.}}
\textit{Milc} is CPU intensive to a large extent with some memory intensive
phases. At higher thresholds, while CPU frequency is tightly bound, performance
clusters cover a wide range of memory settings due to small performance
difference across these frequencies. }
%Thus with increasing inefficiency the clusters change in CPU frequency.
%With increase in
%cluster threshold, the number of available frequency settings settings increase.
%Which eventually leads to less transitions. %Decrease in transitions with
%increase in inefficiency budget is a function of benchmark characteristics.
%\XXXnote{Same text as Fig.~\ref{clusters-gobmk}. Perhaps move this into text body so it isn't
%duplicated?}}
\label{clusters-milc}
\end{figure*}
In this section we study the characteristics of the best performing CPU and
memory frequency settings, \textit{optimal settings}, across different
inefficiency constraints and how they change during application execution. To
find the optimal settings, we wrote a simple algorithm that first filters all
possible frequency settings under given inefficiency budget. It then finds the
CPU and memory frequency settings that result in highest speedup. In cases
where multiple settings result in similar speedup (within 0.5\%), to filter out
simulation noise, the algorithm selects the settings with highest CPU (first)
and memory frequency as this setting is bound to have highest performance among
the other possibilities.
Figure~\ref{gobmk-optimal} plots the optimal settings for Gobmk for all
benchmark samples (each of length 10 million instructions) across multiple
inefficiency constraints. At low inefficiencies, the optimal settings follow
the trends in CPI (cycles per instruction) and MPKI (misses per thousand
instructions). Regions of higher CPI correspond to memory intensive phases, as
the SPEC benchmarks don't have any IO or interrupt based portions.
%\XXXnote{possibly reword the next sentence-Dave}
%The higher the CPI is, the higher
%the memory frequency of the optimal settings is (sample 7) to serve high memory
%traffic.
For phases that are CPU intensive with (lower CPI), the optimal settings have
higher CPU frequency and lower memory frequency. % (sample 9 and 10). At low
At low inefficiency constraints, due to the limited energy budget, a careful
allocation of energy across components becomes critical to achieve optimal
performance. Higher inefficiencies allow the algorithms to select higher
frequency settings in order to achieve greater speedup. We define unconstrained
inefficiency (labeled $\infty$) as the scenario in which the algorithm always
chooses the highest frequency settings as these settings always deliver the
highest performance.
%---though may be wasting energy in some cases.
%Our algorithm can be
%improved to make trade-offs even at higher inefficiencies to not waste
%inefficiency, but we leave that to future work.
There are two key problems associated with tracking the optimal settings:
%MH: I liked this formatting better than the enumerated list
%\begin{enumerate}
%\item
\noindent \textit{It is expensive.} Running the tuning algorithm at the end of
every sample to track optimal settings comes at a cost: 1) searching and
discovering the optimal settings 2) real hardware has transition latency
overhead for both the CPU and the memory frequency.
%transitions in .
Reducing the frequency at which tuning algorithms need to re-tune is critical to
reduce the cost of tuning overhead on application performance.
%\item
\noindent \textit{Limited energy performance trade-off options.} Choosing the
optimal settings for every sample may hinder some energy-performance trade-off
that could have been made if performance was not so tightly bounded (to only
highest performance). For example, \textit{bzip2} is CPU bound and therefore
its performance at memory frequency of 200MHz is within 3\% of performance at a
memory frequency of 800MHz while CPU is running at 1000MHz. By sacrificing that
3\% of performance, the system could have consumed 1/4 the memory background
energy staying well under the given inefficiency budget.
%\end{enumerate}
We believe that, if the user is willing to sacrifice some performance under
given inefficiency budget, algorithms would be able to make better trade-offs
between the cost of frequent tuning and performance. %loss due to infrequent or
%non-optimal tuning.