Commit 31e336b9a5e630ce73377190ccaf33171ea8b5da

Authored by Rizwana Begum
1 parent 5611d5db

draft.

acknowledgement.tex
1 1 \section{Acknowledgement}
2   -This material is based on work partially supported by NSF Collaborative Awards
  2 +This material is based on work partially supported by NSF Awards
3 3 CSR-1409014 and CSR-1409367. Any opinion, findings, and conclusions or
4 4 recommendations expressed in this material are those of the authors and do not
5 5 necessarily reflect the views of the National Science Foundation.
... ...
figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff.pdf
No preview for this file type
figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff_cpi_mpki.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/energy_bar_normalized_0.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/energy_bar_normalized_1.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/energy_bar_normalized_5.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/energy_perf_bar_1.3.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/performance_bar_normalized_0.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/performance_bar_normalized_1.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/performance_bar_normalized_5.0_0_0.pdf
No preview for this file type
figures/plots/496/speedup_inefficiency/heatmap_inefficiency.pdf
No preview for this file type
figures/plots/496/stable_length_box/stable_length_box.pdf
No preview for this file type
figures/plots/496/stable_line_plots/lbm_stable_lineplot_annotated_5.pdf
No preview for this file type
figures/plots/496/stable_line_plots/stable_lineplot.pdf
No preview for this file type
inefficiency.tex
... ... @@ -121,11 +121,11 @@ both the energy ($E$) consumed by the application and the minimum energy
121 121 %
122 122 Computing $E$ is straight
123 123 forward; Intel Sandy bridge architecture ~\cite{sandy-bridge-sw-manual} already
124   -provides performance counters capable of measuring energy consumption at
  124 +provides counters capable of measuring energy consumption at
125 125 runtime and the research community has
126 126 tools and models to estimate the absolute energy of applications~\cite{brooks2000wattch,drampower-tool,li2009mcpat,micronpowercalc-lpddr3-url,wilton1996cacti}.
127 127  
128   -Computing $E_{min}$ is challenging because of the inter-component
  128 +Computing $E_{min}$ is challenging due to inter-component
129 129 dependencies.
130 130 %
131 131 We propose two methods for computing $E_{min}$:
... ... @@ -151,36 +151,37 @@ We propose two methods for computing $E_{min}$:
151 151  
152 152 \end{itemize}
153 153  
154   -%We are working towards designing efficient energy prediction models for CPU,
155   -%memory and network components.
  154 +We are working towards designing efficient energy prediction models for CPU and
  155 +memory.
156 156 %
157   -%Our models consider cross-component interactions on performance and energy
158   -%consumption.
  157 +Our models consider cross-component interactions on performance and energy
  158 +consumption.
159 159 %
160 160 %%%%%%%% MODEL %%%%%%%%%%
161   -We designed efficient models to predict performance and energy consumption of
162   -CPU and memory at various voltage and frequency settings for a given
163   -application. We plan on using these models to estimate $E_{min}$ of a given set
164   -of instructions.
165   -%We envision a system capable of scaling voltage and frequency of CPU and only
166   -%frequency of DRAM.
167   -Our models consider cross-component interactions on performance and energy.
168   -The performance model uses hardware performance counters to measure amount of time
169   -each component is $Busy$ completing the work, $Idle$ stalled on the other
170   -component and $Waiting$ for more work. We designed systematic methodology to
171   -scale these states to estimate execution time of a given workload at different
172   -voltage and frequency settings. In our model, the $Idle$ time of one component
173   -depends on the settings of the second component. The $Busy$ time of each
174   -component scales with it's own frequency. However, part of the $Busy$ time that
175   -overlaps with the other component is constrained by the slowest component.
176   -
177   -We combine predicted performance with the power models of CPU and memory
178   -described in Section~\ref{subsec-energy-models} to estimate energy consumption
179   -of CPU and memory. Our model has average prediction error of 4\% across SPEC
180   -CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm
181   -(24\%)$. In this work we demonstrate how to use inefficiency, deferring
182   -optimization of $E_{min}$ prediction to future work.
  161 +%We designed efficient models to predict performance and energy consumption of
  162 +%CPU and memory at various voltage and frequency settings for a given
  163 +%application. We plan on using these models to estimate $E_{min}$ of a given set
  164 +%of instructions.
  165 +%%We envision a system capable of scaling voltage and frequency of CPU and only
  166 +%%frequency of DRAM.
  167 +%Our models consider cross-component interactions on performance and energy.
  168 +%The performance model uses hardware performance counters to measure amount of time
  169 +%each component is $Busy$ completing the work, $Idle$ stalled on the other
  170 +%component and $Waiting$ for more work. We designed systematic methodology to
  171 +%scale these states to estimate execution time of a given workload at different
  172 +%voltage and frequency settings. In our model, the $Idle$ time of one component
  173 +%depends on the settings of the second component. The $Busy$ time of each
  174 +%component scales with it's own frequency. However, part of the $Busy$ time that
  175 +%overlaps with the other component is constrained by the slowest component.
  176 +%
  177 +%We combine predicted performance with the power models of CPU and memory
  178 +%described in Section~\ref{subsec-energy-models} to estimate energy consumption
  179 +%of CPU and memory. Our model has average prediction error of 4\% across SPEC
  180 +%CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm
  181 +%(24\%)$.
183 182 %%%%% END OF MODEL %%%%%%
  183 +In this work we demonstrate how to use inefficiency, deferring predicting and
  184 +optimizing $E_{min}$ to future work.
184 185  
185 186 \subsection{Managing Inefficiency}
186 187 %
... ...
inefficiency_speedup.tex
... ... @@ -17,8 +17,8 @@ frequency settings may burn extra energy without improving performance.
17 17  
18 18 We performed offline analysis of the data collected from our simulations to
19 19 study the inefficiency-performance trends for various benchmarks. With a brute
20   -force search, we found $E_{min}$ and computed inefficiency at all frequency
21   -settings. We express performance in terms of $speedup$, the ratio of execution
  20 +force search, we found $E_{min}$ and computed inefficiency at all %frequency
  21 +settings. We express performance in terms of $speedup$, the ratio of execution
22 22 time for a given configuration to the longest execution time.
23 23 % to the execution time at
24 24 %a given frequency setting.
... ... @@ -66,7 +66,7 @@ example, \textit{gobmk} runs 1.5x slower if it is forced to run at budget of
66 66 the inefficiency constraint and \textbf{not} just \textbf{at} the inefficiency
67 67 constraint.} Algorithms forcing the system to run exactly at given budget might end
68 68 up wasting energy or, even worse, degrading performance. A smart algorithm should
69   -a) use no more than given inefficiency budget b) should use only as much
  69 +a) stay under given inefficiency budget b) should use only as much
70 70 inefficiency budget as needed c) and deliver the best performance.
71 71 %\end{enumerate}
72 72  
... ...
introduction.tex
... ... @@ -19,7 +19,8 @@ Still other hardware energy-performance tradeoffs are on the horizon, arising
19 19 from capabilities such as memory frequency scaling~\cite{david2011memory} and nanosecond-speed DVFS
20 20 emerging in next-generation hardware designs~\cite{6084810}.
21 21  
22   -We envision a next-generation smartphone capable of both CPU and memory DVFS.
  22 +We envision a next-generation smartphone capable of scaling both voltage and
  23 +frequency of CPU and only frequency of memory.
23 24 %
24 25 While the addition of memory DVFS can be used to improve energy-constrained
25 26 performance, the larger frequency state space compared to CPU DVFS alone also
... ...
optimal_performance.tex
... ... @@ -7,8 +7,8 @@
7 7 \vspace{-0.5em}
8 8 \caption{\textbf{Optimal Performance Point for \text{Gobmk} Across Inefficiencies:} At
9 9 low inefficiency budgets, the optimal frequency settings follow CPI of the
10   -application, and select high memory frequencies for memory intensive phases with
11   -high CPI.
  10 +application, and select high memory frequencies for memory intensive phases. % with
  11 +%high CPI.
12 12 %to deliver best
13 13 %performance under given inefficiency constraint.
14 14 Higher inefficiency budgets
... ... @@ -62,7 +62,7 @@ and then memory frequency as this setting is bound to have highest performance a
62 62 the other possibilities.
63 63  
64 64 Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all
65   -benchmark samples (each of length 10 million instructions) across multiple
  65 +benchmark samples (each of length 10~M instructions) across multiple
66 66 inefficiency constraints. At low inefficiencies, the optimal settings follow
67 67 the trends in CPI (cycles per instruction) and MPKI (misses per thousand
68 68 instructions). Regions of higher CPI correspond to memory intensive phases, as
... ... @@ -71,7 +71,7 @@ the SPEC benchmarks don't have any IO or interrupt based portions.
71 71 %The higher the CPI is, the higher
72 72 %the memory frequency of the optimal settings is (sample 7) to serve high memory
73 73 %traffic.
74   -For phases that are CPU intensive with (lower CPI), the optimal settings have
  74 +For phases that are CPU intensive (lower CPI), the optimal settings have
75 75 higher CPU frequency and lower memory frequency. % (sample 9 and 10). At low
76 76 At low inefficiency constraints, due to the limited energy budget, a careful
77 77 allocation of energy across components becomes critical to achieve optimal
... ... @@ -92,8 +92,10 @@ There are two key problems associated with tracking the optimal settings:
92 92 \noindent \textit{It is expensive.} Running the tuning algorithm at the end of
93 93 every sample to track optimal settings comes at a cost: 1) searching and
94 94 discovering the optimal settings 2) real hardware has transition latency
95   -overhead for both the CPU and the memory frequency.
96   -%transitions in .
  95 +overhead for both the CPU and the memory frequency. For example, while the search
  96 +algorithm presented by CoScale~\cite{deng2012coscale} takes 5us to find optimal
  97 +frequency settings, time taken by PLLs to change voltage and frequency in commercial processors is in the
  98 +order of 10s of microseconds.
97 99 Reducing the frequency at which tuning algorithms need to re-tune is critical to
98 100 reduce the cost of tuning overhead on application performance.
99 101  
... ... @@ -105,7 +107,7 @@ highest performance). For example, \textit{bzip2} is CPU bound and therefore
105 107 its performance at memory frequency of 200MHz is within 3\% of performance at a
106 108 memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that
107 109 3\% of performance, the system could have consumed 1/4 the memory background
108   -energy staying well under the given inefficiency budget.
  110 +energy saving 2.7\% of the system energy and staying well under the given inefficiency budget.
109 111 %\end{enumerate}
110 112  
111 113 We believe that, if the user is willing to sacrifice some performance under
... ...
performance_clusters.tex
... ... @@ -13,6 +13,7 @@ stable regions and vertical dashed lines mark the transitions made by
13 13 \begin{figure*}[t]
14 14 \begin{subfigure}[t]{\textwidth}
15 15 \centering
  16 + \vspace{-1em}
16 17 \includegraphics[width=\columnwidth]{{./figures/plots/496/stable_line_plots/stable_lineplot}.pdf}
17 18 \end{subfigure}%
18 19 \vspace{0.5em}
... ... @@ -242,8 +243,8 @@ in highest number of transitions. A common observation is that the number of tra
242 243 increase in cluster threshold. For \textit{bzip2}, increase in inefficiency from
243 244 1.0 to 1.3 increases the number
244 245 of transitions needed to track the optimal settings. The number of
245   -available settings increase with inefficiency, giving more choices for the
246   -optimal frequency settings. At an inefficiency budget of 1.6, the average length of a stable region
  246 +available settings increase with inefficiency increasing the average length of
  247 +stable regions. At an inefficiency budget of 1.6, the average length of a stable region
247 248 increases drastically as shown in Figure~\ref{box-lengths}(b), which requires much less
248 249 transitions with 1\% cluster threshold and no transitions with higher cluster thresholds of 3\%
249 250 and 5\%. Note that there is only one point on the box plot for 3\% and 5\%
... ... @@ -294,13 +295,12 @@ at optimal settings. Figures~\ref{energy-perf-trade-off}(a) and
294 295 selecting the settings that don't degrade performance more than specified
295 296 cluster threshold. The figure also shows that with an increase in cluster
296 297 threshold, energy consumption decreases because lower frequency settings can be
297   -chosen at higher cluster thresholds. Figure~\ref{energy-perf-trade-off}(b) shows that
298   -performance (and energy) may improve when tuning overhead is included due to decrease in
299   -frequency transitions. We assume tuning overhead
300   -of 500us and 30uJ, which includes computing inefficiencies, searching for the
301   -optimal setting and transition the hardware to new
302   -settings~\cite{deng2012coscale}. We assumed that a space of 100 settings is
303   -searched for every transition.
  298 +chosen at higher cluster thresholds. Figure~\ref{energy-perf-trade-off}(b) shows
  299 +that performance (and energy) may improve when tuning overhead is included due
  300 +to decrease in frequency transitions. To determine tuning overhead, we wrote a
  301 +simple algorithm to find optimal settings. With search space of 70 frequency
  302 +settings, it resulted in tuning overhead of 500us and 30uJ, which includes computing inefficiencies, searching for the optimal setting
  303 +and transition the hardware to new settings.
304 304 %This is not intuitive and we are investigating the cause of this anomaly
305 305 %\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find
306 306 %the reason for the change}.
... ...
system_methodology.tex
... ... @@ -64,6 +64,7 @@ implementation are shaded in Figure~\ref{fig-system-block-diag}.
64 64 \includegraphics[width=\columnwidth,height=0.15\paperheight]{./figures/plots/496/speedup_inefficiency/heatmap_speedup.pdf}
65 65 \label{heatmap-speedup}
66 66 \end{subfigure}%
  67 +\vspace{-0.5em}
67 68 \caption{\textbf{Inefficiency vs. Speedup For Multiple Applications:} In
68 69 general, performance improves with increasing inefficiency budgets. A poorly
69 70 designed algorithm may select bad frequency settings which could waste energy
... ... @@ -157,17 +158,21 @@ performance and energy data to study the impact of workload dynamics on the
157 158 stability of CPU and memory frequency settings delivering best performance under
158 159 a given inefficiency budget. Note that, all our studies are performed using
159 160 \textit{measured} performance and power data from the simulations, we do not \textit{predict}
160   -performance or energy.
161   -
162   -Although individual energy-performance trade-offs of DVFS for CPU and
163   -DFS for memory have been studied in the past, the trade-off resulting from
164   -the cross-component interaction of these two components has not been
165   -characterized. CoScale~\cite{deng2012coscale} did point out that
166   -interplay of performance and energy consumption of these two
167   -components is complex and did present a heuristic that attempts to
168   -pick the optimal point. In the next Section, we measure and characterize
  161 +performance or energy. The interplay of performance and energy consumption of
  162 +CPU and memory frequency scaling is complex as pointed by
  163 +CoScale~\cite{deng2012coscale}. In the next Section, we measure and characterize
169 164 the larger space of all system level performance and energy trade-offs
170 165 of various CPU and memory frequency settings.
  166 +
  167 +%Although individual energy-performance trade-offs of DVFS for CPU and
  168 +%DFS for memory have been studied in the past, the trade-off resulting from
  169 +%the cross-component interaction of these two components has not been
  170 +%characterized. CoScale~\cite{deng2012coscale} did point out that
  171 +%interplay of performance and energy consumption of these two
  172 +%components is complex and did present a heuristic that attempts to
  173 +%pick the optimal point. In the next Section, we measure and characterize
  174 +%the larger space of all system level performance and energy trade-offs
  175 +%of various CPU and memory frequency settings.
171 176 %In the next section, we study how performance and
172 177 %inefficiency of applications varies with CPU and memory frequencies.
173 178  
... ...