Commit 31e336b9a5e630ce73377190ccaf33171ea8b5da

Authored by Rizwana Begum
1 parent 5611d5db

draft.

acknowledgement.tex
1 \section{Acknowledgement} 1 \section{Acknowledgement}
2 -This material is based on work partially supported by NSF Collaborative Awards 2 +This material is based on work partially supported by NSF Awards
3 CSR-1409014 and CSR-1409367. Any opinion, findings, and conclusions or 3 CSR-1409014 and CSR-1409367. Any opinion, findings, and conclusions or
4 recommendations expressed in this material are those of the authors and do not 4 recommendations expressed in this material are those of the authors and do not
5 necessarily reflect the views of the National Science Foundation. 5 necessarily reflect the views of the National Science Foundation.
figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff.pdf
No preview for this file type
figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff_cpi_mpki.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/energy_bar_normalized_0.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/energy_bar_normalized_1.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/energy_bar_normalized_5.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/energy_perf_bar_1.3.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/performance_bar_normalized_0.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/performance_bar_normalized_1.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/performance_bar_normalized_5.0_0_0.pdf
No preview for this file type
figures/plots/496/speedup_inefficiency/heatmap_inefficiency.pdf
No preview for this file type
figures/plots/496/stable_length_box/stable_length_box.pdf
No preview for this file type
figures/plots/496/stable_line_plots/lbm_stable_lineplot_annotated_5.pdf
No preview for this file type
figures/plots/496/stable_line_plots/stable_lineplot.pdf
No preview for this file type
inefficiency.tex
@@ -121,11 +121,11 @@ both the energy ($E$) consumed by the application and the minimum energy @@ -121,11 +121,11 @@ both the energy ($E$) consumed by the application and the minimum energy
121 % 121 %
122 Computing $E$ is straight 122 Computing $E$ is straight
123 forward; Intel Sandy bridge architecture ~\cite{sandy-bridge-sw-manual} already 123 forward; Intel Sandy bridge architecture ~\cite{sandy-bridge-sw-manual} already
124 -provides performance counters capable of measuring energy consumption at 124 +provides counters capable of measuring energy consumption at
125 runtime and the research community has 125 runtime and the research community has
126 tools and models to estimate the absolute energy of applications~\cite{brooks2000wattch,drampower-tool,li2009mcpat,micronpowercalc-lpddr3-url,wilton1996cacti}. 126 tools and models to estimate the absolute energy of applications~\cite{brooks2000wattch,drampower-tool,li2009mcpat,micronpowercalc-lpddr3-url,wilton1996cacti}.
127 127
128 -Computing $E_{min}$ is challenging because of the inter-component 128 +Computing $E_{min}$ is challenging due to inter-component
129 dependencies. 129 dependencies.
130 % 130 %
131 We propose two methods for computing $E_{min}$: 131 We propose two methods for computing $E_{min}$:
@@ -151,36 +151,37 @@ We propose two methods for computing $E_{min}$: @@ -151,36 +151,37 @@ We propose two methods for computing $E_{min}$:
151 151
152 \end{itemize} 152 \end{itemize}
153 153
154 -%We are working towards designing efficient energy prediction models for CPU,  
155 -%memory and network components. 154 +We are working towards designing efficient energy prediction models for CPU and
  155 +memory.
156 % 156 %
157 -%Our models consider cross-component interactions on performance and energy  
158 -%consumption. 157 +Our models consider cross-component interactions on performance and energy
  158 +consumption.
159 % 159 %
160 %%%%%%%% MODEL %%%%%%%%%% 160 %%%%%%%% MODEL %%%%%%%%%%
161 -We designed efficient models to predict performance and energy consumption of  
162 -CPU and memory at various voltage and frequency settings for a given  
163 -application. We plan on using these models to estimate $E_{min}$ of a given set  
164 -of instructions.  
165 -%We envision a system capable of scaling voltage and frequency of CPU and only  
166 -%frequency of DRAM.  
167 -Our models consider cross-component interactions on performance and energy.  
168 -The performance model uses hardware performance counters to measure amount of time  
169 -each component is $Busy$ completing the work, $Idle$ stalled on the other  
170 -component and $Waiting$ for more work. We designed systematic methodology to  
171 -scale these states to estimate execution time of a given workload at different  
172 -voltage and frequency settings. In our model, the $Idle$ time of one component  
173 -depends on the settings of the second component. The $Busy$ time of each  
174 -component scales with it's own frequency. However, part of the $Busy$ time that  
175 -overlaps with the other component is constrained by the slowest component.  
176 -  
177 -We combine predicted performance with the power models of CPU and memory  
178 -described in Section~\ref{subsec-energy-models} to estimate energy consumption  
179 -of CPU and memory. Our model has average prediction error of 4\% across SPEC  
180 -CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm  
181 -(24\%)$. In this work we demonstrate how to use inefficiency, deferring  
182 -optimization of $E_{min}$ prediction to future work. 161 +%We designed efficient models to predict performance and energy consumption of
  162 +%CPU and memory at various voltage and frequency settings for a given
  163 +%application. We plan on using these models to estimate $E_{min}$ of a given set
  164 +%of instructions.
  165 +%%We envision a system capable of scaling voltage and frequency of CPU and only
  166 +%%frequency of DRAM.
  167 +%Our models consider cross-component interactions on performance and energy.
  168 +%The performance model uses hardware performance counters to measure amount of time
  169 +%each component is $Busy$ completing the work, $Idle$ stalled on the other
  170 +%component and $Waiting$ for more work. We designed systematic methodology to
  171 +%scale these states to estimate execution time of a given workload at different
  172 +%voltage and frequency settings. In our model, the $Idle$ time of one component
  173 +%depends on the settings of the second component. The $Busy$ time of each
  174 +%component scales with it's own frequency. However, part of the $Busy$ time that
  175 +%overlaps with the other component is constrained by the slowest component.
  176 +%
  177 +%We combine predicted performance with the power models of CPU and memory
  178 +%described in Section~\ref{subsec-energy-models} to estimate energy consumption
  179 +%of CPU and memory. Our model has average prediction error of 4\% across SPEC
  180 +%CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm
  181 +%(24\%)$.
183 %%%%% END OF MODEL %%%%%% 182 %%%%% END OF MODEL %%%%%%
  183 +In this work we demonstrate how to use inefficiency, deferring predicting and
  184 +optimizing $E_{min}$ to future work.
184 185
185 \subsection{Managing Inefficiency} 186 \subsection{Managing Inefficiency}
186 % 187 %
inefficiency_speedup.tex
@@ -17,8 +17,8 @@ frequency settings may burn extra energy without improving performance. @@ -17,8 +17,8 @@ frequency settings may burn extra energy without improving performance.
17 17
18 We performed offline analysis of the data collected from our simulations to 18 We performed offline analysis of the data collected from our simulations to
19 study the inefficiency-performance trends for various benchmarks. With a brute 19 study the inefficiency-performance trends for various benchmarks. With a brute
20 -force search, we found $E_{min}$ and computed inefficiency at all frequency  
21 -settings. We express performance in terms of $speedup$, the ratio of execution 20 +force search, we found $E_{min}$ and computed inefficiency at all %frequency
  21 +settings. We express performance in terms of $speedup$, the ratio of execution
22 time for a given configuration to the longest execution time. 22 time for a given configuration to the longest execution time.
23 % to the execution time at 23 % to the execution time at
24 %a given frequency setting. 24 %a given frequency setting.
@@ -66,7 +66,7 @@ example, \textit{gobmk} runs 1.5x slower if it is forced to run at budget of @@ -66,7 +66,7 @@ example, \textit{gobmk} runs 1.5x slower if it is forced to run at budget of
66 the inefficiency constraint and \textbf{not} just \textbf{at} the inefficiency 66 the inefficiency constraint and \textbf{not} just \textbf{at} the inefficiency
67 constraint.} Algorithms forcing the system to run exactly at given budget might end 67 constraint.} Algorithms forcing the system to run exactly at given budget might end
68 up wasting energy or, even worse, degrading performance. A smart algorithm should 68 up wasting energy or, even worse, degrading performance. A smart algorithm should
69 -a) use no more than given inefficiency budget b) should use only as much 69 +a) stay under given inefficiency budget b) should use only as much
70 inefficiency budget as needed c) and deliver the best performance. 70 inefficiency budget as needed c) and deliver the best performance.
71 %\end{enumerate} 71 %\end{enumerate}
72 72
introduction.tex
@@ -19,7 +19,8 @@ Still other hardware energy-performance tradeoffs are on the horizon, arising @@ -19,7 +19,8 @@ Still other hardware energy-performance tradeoffs are on the horizon, arising
19 from capabilities such as memory frequency scaling~\cite{david2011memory} and nanosecond-speed DVFS 19 from capabilities such as memory frequency scaling~\cite{david2011memory} and nanosecond-speed DVFS
20 emerging in next-generation hardware designs~\cite{6084810}. 20 emerging in next-generation hardware designs~\cite{6084810}.
21 21
22 -We envision a next-generation smartphone capable of both CPU and memory DVFS. 22 +We envision a next-generation smartphone capable of scaling both voltage and
  23 +frequency of CPU and only frequency of memory.
23 % 24 %
24 While the addition of memory DVFS can be used to improve energy-constrained 25 While the addition of memory DVFS can be used to improve energy-constrained
25 performance, the larger frequency state space compared to CPU DVFS alone also 26 performance, the larger frequency state space compared to CPU DVFS alone also
optimal_performance.tex
@@ -7,8 +7,8 @@ @@ -7,8 +7,8 @@
7 \vspace{-0.5em} 7 \vspace{-0.5em}
8 \caption{\textbf{Optimal Performance Point for \text{Gobmk} Across Inefficiencies:} At 8 \caption{\textbf{Optimal Performance Point for \text{Gobmk} Across Inefficiencies:} At
9 low inefficiency budgets, the optimal frequency settings follow CPI of the 9 low inefficiency budgets, the optimal frequency settings follow CPI of the
10 -application, and select high memory frequencies for memory intensive phases with  
11 -high CPI. 10 +application, and select high memory frequencies for memory intensive phases. % with
  11 +%high CPI.
12 %to deliver best 12 %to deliver best
13 %performance under given inefficiency constraint. 13 %performance under given inefficiency constraint.
14 Higher inefficiency budgets 14 Higher inefficiency budgets
@@ -62,7 +62,7 @@ and then memory frequency as this setting is bound to have highest performance a @@ -62,7 +62,7 @@ and then memory frequency as this setting is bound to have highest performance a
62 the other possibilities. 62 the other possibilities.
63 63
64 Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all 64 Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all
65 -benchmark samples (each of length 10 million instructions) across multiple 65 +benchmark samples (each of length 10~M instructions) across multiple
66 inefficiency constraints. At low inefficiencies, the optimal settings follow 66 inefficiency constraints. At low inefficiencies, the optimal settings follow
67 the trends in CPI (cycles per instruction) and MPKI (misses per thousand 67 the trends in CPI (cycles per instruction) and MPKI (misses per thousand
68 instructions). Regions of higher CPI correspond to memory intensive phases, as 68 instructions). Regions of higher CPI correspond to memory intensive phases, as
@@ -71,7 +71,7 @@ the SPEC benchmarks don't have any IO or interrupt based portions. @@ -71,7 +71,7 @@ the SPEC benchmarks don't have any IO or interrupt based portions.
71 %The higher the CPI is, the higher 71 %The higher the CPI is, the higher
72 %the memory frequency of the optimal settings is (sample 7) to serve high memory 72 %the memory frequency of the optimal settings is (sample 7) to serve high memory
73 %traffic. 73 %traffic.
74 -For phases that are CPU intensive with (lower CPI), the optimal settings have 74 +For phases that are CPU intensive (lower CPI), the optimal settings have
75 higher CPU frequency and lower memory frequency. % (sample 9 and 10). At low 75 higher CPU frequency and lower memory frequency. % (sample 9 and 10). At low
76 At low inefficiency constraints, due to the limited energy budget, a careful 76 At low inefficiency constraints, due to the limited energy budget, a careful
77 allocation of energy across components becomes critical to achieve optimal 77 allocation of energy across components becomes critical to achieve optimal
@@ -92,8 +92,10 @@ There are two key problems associated with tracking the optimal settings: @@ -92,8 +92,10 @@ There are two key problems associated with tracking the optimal settings:
92 \noindent \textit{It is expensive.} Running the tuning algorithm at the end of 92 \noindent \textit{It is expensive.} Running the tuning algorithm at the end of
93 every sample to track optimal settings comes at a cost: 1) searching and 93 every sample to track optimal settings comes at a cost: 1) searching and
94 discovering the optimal settings 2) real hardware has transition latency 94 discovering the optimal settings 2) real hardware has transition latency
95 -overhead for both the CPU and the memory frequency.  
96 -%transitions in . 95 +overhead for both the CPU and the memory frequency. For example, while the search
  96 +algorithm presented by CoScale~\cite{deng2012coscale} takes 5us to find optimal
  97 +frequency settings, time taken by PLLs to change voltage and frequency in commercial processors is in the
  98 +order of 10s of microseconds.
97 Reducing the frequency at which tuning algorithms need to re-tune is critical to 99 Reducing the frequency at which tuning algorithms need to re-tune is critical to
98 reduce the cost of tuning overhead on application performance. 100 reduce the cost of tuning overhead on application performance.
99 101
@@ -105,7 +107,7 @@ highest performance). For example, \textit{bzip2} is CPU bound and therefore @@ -105,7 +107,7 @@ highest performance). For example, \textit{bzip2} is CPU bound and therefore
105 its performance at memory frequency of 200MHz is within 3\% of performance at a 107 its performance at memory frequency of 200MHz is within 3\% of performance at a
106 memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that 108 memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that
107 3\% of performance, the system could have consumed 1/4 the memory background 109 3\% of performance, the system could have consumed 1/4 the memory background
108 -energy staying well under the given inefficiency budget. 110 +energy saving 2.7\% of the system energy and staying well under the given inefficiency budget.
109 %\end{enumerate} 111 %\end{enumerate}
110 112
111 We believe that, if the user is willing to sacrifice some performance under 113 We believe that, if the user is willing to sacrifice some performance under
performance_clusters.tex
@@ -13,6 +13,7 @@ stable regions and vertical dashed lines mark the transitions made by @@ -13,6 +13,7 @@ stable regions and vertical dashed lines mark the transitions made by
13 \begin{figure*}[t] 13 \begin{figure*}[t]
14 \begin{subfigure}[t]{\textwidth} 14 \begin{subfigure}[t]{\textwidth}
15 \centering 15 \centering
  16 + \vspace{-1em}
16 \includegraphics[width=\columnwidth]{{./figures/plots/496/stable_line_plots/stable_lineplot}.pdf} 17 \includegraphics[width=\columnwidth]{{./figures/plots/496/stable_line_plots/stable_lineplot}.pdf}
17 \end{subfigure}% 18 \end{subfigure}%
18 \vspace{0.5em} 19 \vspace{0.5em}
@@ -242,8 +243,8 @@ in highest number of transitions. A common observation is that the number of tra @@ -242,8 +243,8 @@ in highest number of transitions. A common observation is that the number of tra
242 increase in cluster threshold. For \textit{bzip2}, increase in inefficiency from 243 increase in cluster threshold. For \textit{bzip2}, increase in inefficiency from
243 1.0 to 1.3 increases the number 244 1.0 to 1.3 increases the number
244 of transitions needed to track the optimal settings. The number of 245 of transitions needed to track the optimal settings. The number of
245 -available settings increase with inefficiency, giving more choices for the  
246 -optimal frequency settings. At an inefficiency budget of 1.6, the average length of a stable region 246 +available settings increase with inefficiency increasing the average length of
  247 +stable regions. At an inefficiency budget of 1.6, the average length of a stable region
247 increases drastically as shown in Figure~\ref{box-lengths}(b), which requires much less 248 increases drastically as shown in Figure~\ref{box-lengths}(b), which requires much less
248 transitions with 1\% cluster threshold and no transitions with higher cluster thresholds of 3\% 249 transitions with 1\% cluster threshold and no transitions with higher cluster thresholds of 3\%
249 and 5\%. Note that there is only one point on the box plot for 3\% and 5\% 250 and 5\%. Note that there is only one point on the box plot for 3\% and 5\%
@@ -294,13 +295,12 @@ at optimal settings. Figures~\ref{energy-perf-trade-off}(a) and @@ -294,13 +295,12 @@ at optimal settings. Figures~\ref{energy-perf-trade-off}(a) and
294 selecting the settings that don't degrade performance more than specified 295 selecting the settings that don't degrade performance more than specified
295 cluster threshold. The figure also shows that with an increase in cluster 296 cluster threshold. The figure also shows that with an increase in cluster
296 threshold, energy consumption decreases because lower frequency settings can be 297 threshold, energy consumption decreases because lower frequency settings can be
297 -chosen at higher cluster thresholds. Figure~\ref{energy-perf-trade-off}(b) shows that  
298 -performance (and energy) may improve when tuning overhead is included due to decrease in  
299 -frequency transitions. We assume tuning overhead  
300 -of 500us and 30uJ, which includes computing inefficiencies, searching for the  
301 -optimal setting and transition the hardware to new  
302 -settings~\cite{deng2012coscale}. We assumed that a space of 100 settings is  
303 -searched for every transition. 298 +chosen at higher cluster thresholds. Figure~\ref{energy-perf-trade-off}(b) shows
  299 +that performance (and energy) may improve when tuning overhead is included due
  300 +to decrease in frequency transitions. To determine tuning overhead, we wrote a
  301 +simple algorithm to find optimal settings. With search space of 70 frequency
  302 +settings, it resulted in tuning overhead of 500us and 30uJ, which includes computing inefficiencies, searching for the optimal setting
  303 +and transition the hardware to new settings.
304 %This is not intuitive and we are investigating the cause of this anomaly 304 %This is not intuitive and we are investigating the cause of this anomaly
305 %\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find 305 %\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find
306 %the reason for the change}. 306 %the reason for the change}.
system_methodology.tex
@@ -64,6 +64,7 @@ implementation are shaded in Figure~\ref{fig-system-block-diag}. @@ -64,6 +64,7 @@ implementation are shaded in Figure~\ref{fig-system-block-diag}.
64 \includegraphics[width=\columnwidth,height=0.15\paperheight]{./figures/plots/496/speedup_inefficiency/heatmap_speedup.pdf} 64 \includegraphics[width=\columnwidth,height=0.15\paperheight]{./figures/plots/496/speedup_inefficiency/heatmap_speedup.pdf}
65 \label{heatmap-speedup} 65 \label{heatmap-speedup}
66 \end{subfigure}% 66 \end{subfigure}%
  67 +\vspace{-0.5em}
67 \caption{\textbf{Inefficiency vs. Speedup For Multiple Applications:} In 68 \caption{\textbf{Inefficiency vs. Speedup For Multiple Applications:} In
68 general, performance improves with increasing inefficiency budgets. A poorly 69 general, performance improves with increasing inefficiency budgets. A poorly
69 designed algorithm may select bad frequency settings which could waste energy 70 designed algorithm may select bad frequency settings which could waste energy
@@ -157,17 +158,21 @@ performance and energy data to study the impact of workload dynamics on the @@ -157,17 +158,21 @@ performance and energy data to study the impact of workload dynamics on the
157 stability of CPU and memory frequency settings delivering best performance under 158 stability of CPU and memory frequency settings delivering best performance under
158 a given inefficiency budget. Note that, all our studies are performed using 159 a given inefficiency budget. Note that, all our studies are performed using
159 \textit{measured} performance and power data from the simulations, we do not \textit{predict} 160 \textit{measured} performance and power data from the simulations, we do not \textit{predict}
160 -performance or energy.  
161 -  
162 -Although individual energy-performance trade-offs of DVFS for CPU and  
163 -DFS for memory have been studied in the past, the trade-off resulting from  
164 -the cross-component interaction of these two components has not been  
165 -characterized. CoScale~\cite{deng2012coscale} did point out that  
166 -interplay of performance and energy consumption of these two  
167 -components is complex and did present a heuristic that attempts to  
168 -pick the optimal point. In the next Section, we measure and characterize 161 +performance or energy. The interplay of performance and energy consumption of
  162 +CPU and memory frequency scaling is complex as pointed by
  163 +CoScale~\cite{deng2012coscale}. In the next Section, we measure and characterize
169 the larger space of all system level performance and energy trade-offs 164 the larger space of all system level performance and energy trade-offs
170 of various CPU and memory frequency settings. 165 of various CPU and memory frequency settings.
  166 +
  167 +%Although individual energy-performance trade-offs of DVFS for CPU and
  168 +%DFS for memory have been studied in the past, the trade-off resulting from
  169 +%the cross-component interaction of these two components has not been
  170 +%characterized. CoScale~\cite{deng2012coscale} did point out that
  171 +%interplay of performance and energy consumption of these two
  172 +%components is complex and did present a heuristic that attempts to
  173 +%pick the optimal point. In the next Section, we measure and characterize
  174 +%the larger space of all system level performance and energy trade-offs
  175 +%of various CPU and memory frequency settings.
171 %In the next section, we study how performance and 176 %In the next section, we study how performance and
172 %inefficiency of applications varies with CPU and memory frequencies. 177 %inefficiency of applications varies with CPU and memory frequencies.
173 178