Commit 31e336b9a5e630ce73377190ccaf33171ea8b5da
1 parent
5611d5db
draft.
Showing
20 changed files
with
67 additions
and
58 deletions
acknowledgement.tex
| 1 | 1 | \section{Acknowledgement} |
| 2 | -This material is based on work partially supported by NSF Collaborative Awards | |
| 2 | +This material is based on work partially supported by NSF Awards | |
| 3 | 3 | CSR-1409014 and CSR-1409367. Any opinion, findings, and conclusions or |
| 4 | 4 | recommendations expressed in this material are those of the authors and do not |
| 5 | 5 | necessarily reflect the views of the National Science Foundation. | ... | ... |
figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff.pdf
No preview for this file type
figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff_cpi_mpki.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/energy_bar_normalized_0.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/energy_bar_normalized_1.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/energy_bar_normalized_5.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/energy_perf_bar_1.3.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/performance_bar_normalized_0.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/performance_bar_normalized_1.0_0_0.pdf
No preview for this file type
figures/plots/496/energy_perf_bar/performance_bar_normalized_5.0_0_0.pdf
No preview for this file type
figures/plots/496/speedup_inefficiency/heatmap_inefficiency.pdf
No preview for this file type
figures/plots/496/stable_length_box/stable_length_box.pdf
No preview for this file type
figures/plots/496/stable_line_plots/lbm_stable_lineplot_annotated_5.pdf
No preview for this file type
figures/plots/496/stable_line_plots/stable_lineplot.pdf
No preview for this file type
inefficiency.tex
| ... | ... | @@ -121,11 +121,11 @@ both the energy ($E$) consumed by the application and the minimum energy |
| 121 | 121 | % |
| 122 | 122 | Computing $E$ is straight |
| 123 | 123 | forward; Intel Sandy bridge architecture ~\cite{sandy-bridge-sw-manual} already |
| 124 | -provides performance counters capable of measuring energy consumption at | |
| 124 | +provides counters capable of measuring energy consumption at | |
| 125 | 125 | runtime and the research community has |
| 126 | 126 | tools and models to estimate the absolute energy of applications~\cite{brooks2000wattch,drampower-tool,li2009mcpat,micronpowercalc-lpddr3-url,wilton1996cacti}. |
| 127 | 127 | |
| 128 | -Computing $E_{min}$ is challenging because of the inter-component | |
| 128 | +Computing $E_{min}$ is challenging due to inter-component | |
| 129 | 129 | dependencies. |
| 130 | 130 | % |
| 131 | 131 | We propose two methods for computing $E_{min}$: |
| ... | ... | @@ -151,36 +151,37 @@ We propose two methods for computing $E_{min}$: |
| 151 | 151 | |
| 152 | 152 | \end{itemize} |
| 153 | 153 | |
| 154 | -%We are working towards designing efficient energy prediction models for CPU, | |
| 155 | -%memory and network components. | |
| 154 | +We are working towards designing efficient energy prediction models for CPU and | |
| 155 | +memory. | |
| 156 | 156 | % |
| 157 | -%Our models consider cross-component interactions on performance and energy | |
| 158 | -%consumption. | |
| 157 | +Our models consider cross-component interactions on performance and energy | |
| 158 | +consumption. | |
| 159 | 159 | % |
| 160 | 160 | %%%%%%%% MODEL %%%%%%%%%% |
| 161 | -We designed efficient models to predict performance and energy consumption of | |
| 162 | -CPU and memory at various voltage and frequency settings for a given | |
| 163 | -application. We plan on using these models to estimate $E_{min}$ of a given set | |
| 164 | -of instructions. | |
| 165 | -%We envision a system capable of scaling voltage and frequency of CPU and only | |
| 166 | -%frequency of DRAM. | |
| 167 | -Our models consider cross-component interactions on performance and energy. | |
| 168 | -The performance model uses hardware performance counters to measure amount of time | |
| 169 | -each component is $Busy$ completing the work, $Idle$ stalled on the other | |
| 170 | -component and $Waiting$ for more work. We designed systematic methodology to | |
| 171 | -scale these states to estimate execution time of a given workload at different | |
| 172 | -voltage and frequency settings. In our model, the $Idle$ time of one component | |
| 173 | -depends on the settings of the second component. The $Busy$ time of each | |
| 174 | -component scales with it's own frequency. However, part of the $Busy$ time that | |
| 175 | -overlaps with the other component is constrained by the slowest component. | |
| 176 | - | |
| 177 | -We combine predicted performance with the power models of CPU and memory | |
| 178 | -described in Section~\ref{subsec-energy-models} to estimate energy consumption | |
| 179 | -of CPU and memory. Our model has average prediction error of 4\% across SPEC | |
| 180 | -CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm | |
| 181 | -(24\%)$. In this work we demonstrate how to use inefficiency, deferring | |
| 182 | -optimization of $E_{min}$ prediction to future work. | |
| 161 | +%We designed efficient models to predict performance and energy consumption of | |
| 162 | +%CPU and memory at various voltage and frequency settings for a given | |
| 163 | +%application. We plan on using these models to estimate $E_{min}$ of a given set | |
| 164 | +%of instructions. | |
| 165 | +%%We envision a system capable of scaling voltage and frequency of CPU and only | |
| 166 | +%%frequency of DRAM. | |
| 167 | +%Our models consider cross-component interactions on performance and energy. | |
| 168 | +%The performance model uses hardware performance counters to measure amount of time | |
| 169 | +%each component is $Busy$ completing the work, $Idle$ stalled on the other | |
| 170 | +%component and $Waiting$ for more work. We designed systematic methodology to | |
| 171 | +%scale these states to estimate execution time of a given workload at different | |
| 172 | +%voltage and frequency settings. In our model, the $Idle$ time of one component | |
| 173 | +%depends on the settings of the second component. The $Busy$ time of each | |
| 174 | +%component scales with it's own frequency. However, part of the $Busy$ time that | |
| 175 | +%overlaps with the other component is constrained by the slowest component. | |
| 176 | +% | |
| 177 | +%We combine predicted performance with the power models of CPU and memory | |
| 178 | +%described in Section~\ref{subsec-energy-models} to estimate energy consumption | |
| 179 | +%of CPU and memory. Our model has average prediction error of 4\% across SPEC | |
| 180 | +%CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm | |
| 181 | +%(24\%)$. | |
| 183 | 182 | %%%%% END OF MODEL %%%%%% |
| 183 | +In this work we demonstrate how to use inefficiency, deferring predicting and | |
| 184 | +optimizing $E_{min}$ to future work. | |
| 184 | 185 | |
| 185 | 186 | \subsection{Managing Inefficiency} |
| 186 | 187 | % | ... | ... |
inefficiency_speedup.tex
| ... | ... | @@ -17,8 +17,8 @@ frequency settings may burn extra energy without improving performance. |
| 17 | 17 | |
| 18 | 18 | We performed offline analysis of the data collected from our simulations to |
| 19 | 19 | study the inefficiency-performance trends for various benchmarks. With a brute |
| 20 | -force search, we found $E_{min}$ and computed inefficiency at all frequency | |
| 21 | -settings. We express performance in terms of $speedup$, the ratio of execution | |
| 20 | +force search, we found $E_{min}$ and computed inefficiency at all %frequency | |
| 21 | +settings. We express performance in terms of $speedup$, the ratio of execution | |
| 22 | 22 | time for a given configuration to the longest execution time. |
| 23 | 23 | % to the execution time at |
| 24 | 24 | %a given frequency setting. |
| ... | ... | @@ -66,7 +66,7 @@ example, \textit{gobmk} runs 1.5x slower if it is forced to run at budget of |
| 66 | 66 | the inefficiency constraint and \textbf{not} just \textbf{at} the inefficiency |
| 67 | 67 | constraint.} Algorithms forcing the system to run exactly at given budget might end |
| 68 | 68 | up wasting energy or, even worse, degrading performance. A smart algorithm should |
| 69 | -a) use no more than given inefficiency budget b) should use only as much | |
| 69 | +a) stay under given inefficiency budget b) should use only as much | |
| 70 | 70 | inefficiency budget as needed c) and deliver the best performance. |
| 71 | 71 | %\end{enumerate} |
| 72 | 72 | ... | ... |
introduction.tex
| ... | ... | @@ -19,7 +19,8 @@ Still other hardware energy-performance tradeoffs are on the horizon, arising |
| 19 | 19 | from capabilities such as memory frequency scaling~\cite{david2011memory} and nanosecond-speed DVFS |
| 20 | 20 | emerging in next-generation hardware designs~\cite{6084810}. |
| 21 | 21 | |
| 22 | -We envision a next-generation smartphone capable of both CPU and memory DVFS. | |
| 22 | +We envision a next-generation smartphone capable of scaling both voltage and | |
| 23 | +frequency of CPU and only frequency of memory. | |
| 23 | 24 | % |
| 24 | 25 | While the addition of memory DVFS can be used to improve energy-constrained |
| 25 | 26 | performance, the larger frequency state space compared to CPU DVFS alone also | ... | ... |
optimal_performance.tex
| ... | ... | @@ -7,8 +7,8 @@ |
| 7 | 7 | \vspace{-0.5em} |
| 8 | 8 | \caption{\textbf{Optimal Performance Point for \text{Gobmk} Across Inefficiencies:} At |
| 9 | 9 | low inefficiency budgets, the optimal frequency settings follow CPI of the |
| 10 | -application, and select high memory frequencies for memory intensive phases with | |
| 11 | -high CPI. | |
| 10 | +application, and select high memory frequencies for memory intensive phases. % with | |
| 11 | +%high CPI. | |
| 12 | 12 | %to deliver best |
| 13 | 13 | %performance under given inefficiency constraint. |
| 14 | 14 | Higher inefficiency budgets |
| ... | ... | @@ -62,7 +62,7 @@ and then memory frequency as this setting is bound to have highest performance a |
| 62 | 62 | the other possibilities. |
| 63 | 63 | |
| 64 | 64 | Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all |
| 65 | -benchmark samples (each of length 10 million instructions) across multiple | |
| 65 | +benchmark samples (each of length 10~M instructions) across multiple | |
| 66 | 66 | inefficiency constraints. At low inefficiencies, the optimal settings follow |
| 67 | 67 | the trends in CPI (cycles per instruction) and MPKI (misses per thousand |
| 68 | 68 | instructions). Regions of higher CPI correspond to memory intensive phases, as |
| ... | ... | @@ -71,7 +71,7 @@ the SPEC benchmarks don't have any IO or interrupt based portions. |
| 71 | 71 | %The higher the CPI is, the higher |
| 72 | 72 | %the memory frequency of the optimal settings is (sample 7) to serve high memory |
| 73 | 73 | %traffic. |
| 74 | -For phases that are CPU intensive with (lower CPI), the optimal settings have | |
| 74 | +For phases that are CPU intensive (lower CPI), the optimal settings have | |
| 75 | 75 | higher CPU frequency and lower memory frequency. % (sample 9 and 10). At low |
| 76 | 76 | At low inefficiency constraints, due to the limited energy budget, a careful |
| 77 | 77 | allocation of energy across components becomes critical to achieve optimal |
| ... | ... | @@ -92,8 +92,10 @@ There are two key problems associated with tracking the optimal settings: |
| 92 | 92 | \noindent \textit{It is expensive.} Running the tuning algorithm at the end of |
| 93 | 93 | every sample to track optimal settings comes at a cost: 1) searching and |
| 94 | 94 | discovering the optimal settings 2) real hardware has transition latency |
| 95 | -overhead for both the CPU and the memory frequency. | |
| 96 | -%transitions in . | |
| 95 | +overhead for both the CPU and the memory frequency. For example, while the search | |
| 96 | +algorithm presented by CoScale~\cite{deng2012coscale} takes 5us to find optimal | |
| 97 | +frequency settings, time taken by PLLs to change voltage and frequency in commercial processors is in the | |
| 98 | +order of 10s of microseconds. | |
| 97 | 99 | Reducing the frequency at which tuning algorithms need to re-tune is critical to |
| 98 | 100 | reduce the cost of tuning overhead on application performance. |
| 99 | 101 | |
| ... | ... | @@ -105,7 +107,7 @@ highest performance). For example, \textit{bzip2} is CPU bound and therefore |
| 105 | 107 | its performance at memory frequency of 200MHz is within 3\% of performance at a |
| 106 | 108 | memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that |
| 107 | 109 | 3\% of performance, the system could have consumed 1/4 the memory background |
| 108 | -energy staying well under the given inefficiency budget. | |
| 110 | +energy saving 2.7\% of the system energy and staying well under the given inefficiency budget. | |
| 109 | 111 | %\end{enumerate} |
| 110 | 112 | |
| 111 | 113 | We believe that, if the user is willing to sacrifice some performance under | ... | ... |
performance_clusters.tex
| ... | ... | @@ -13,6 +13,7 @@ stable regions and vertical dashed lines mark the transitions made by |
| 13 | 13 | \begin{figure*}[t] |
| 14 | 14 | \begin{subfigure}[t]{\textwidth} |
| 15 | 15 | \centering |
| 16 | + \vspace{-1em} | |
| 16 | 17 | \includegraphics[width=\columnwidth]{{./figures/plots/496/stable_line_plots/stable_lineplot}.pdf} |
| 17 | 18 | \end{subfigure}% |
| 18 | 19 | \vspace{0.5em} |
| ... | ... | @@ -242,8 +243,8 @@ in highest number of transitions. A common observation is that the number of tra |
| 242 | 243 | increase in cluster threshold. For \textit{bzip2}, increase in inefficiency from |
| 243 | 244 | 1.0 to 1.3 increases the number |
| 244 | 245 | of transitions needed to track the optimal settings. The number of |
| 245 | -available settings increase with inefficiency, giving more choices for the | |
| 246 | -optimal frequency settings. At an inefficiency budget of 1.6, the average length of a stable region | |
| 246 | +available settings increase with inefficiency increasing the average length of | |
| 247 | +stable regions. At an inefficiency budget of 1.6, the average length of a stable region | |
| 247 | 248 | increases drastically as shown in Figure~\ref{box-lengths}(b), which requires much less |
| 248 | 249 | transitions with 1\% cluster threshold and no transitions with higher cluster thresholds of 3\% |
| 249 | 250 | and 5\%. Note that there is only one point on the box plot for 3\% and 5\% |
| ... | ... | @@ -294,13 +295,12 @@ at optimal settings. Figures~\ref{energy-perf-trade-off}(a) and |
| 294 | 295 | selecting the settings that don't degrade performance more than specified |
| 295 | 296 | cluster threshold. The figure also shows that with an increase in cluster |
| 296 | 297 | threshold, energy consumption decreases because lower frequency settings can be |
| 297 | -chosen at higher cluster thresholds. Figure~\ref{energy-perf-trade-off}(b) shows that | |
| 298 | -performance (and energy) may improve when tuning overhead is included due to decrease in | |
| 299 | -frequency transitions. We assume tuning overhead | |
| 300 | -of 500us and 30uJ, which includes computing inefficiencies, searching for the | |
| 301 | -optimal setting and transition the hardware to new | |
| 302 | -settings~\cite{deng2012coscale}. We assumed that a space of 100 settings is | |
| 303 | -searched for every transition. | |
| 298 | +chosen at higher cluster thresholds. Figure~\ref{energy-perf-trade-off}(b) shows | |
| 299 | +that performance (and energy) may improve when tuning overhead is included due | |
| 300 | +to decrease in frequency transitions. To determine tuning overhead, we wrote a | |
| 301 | +simple algorithm to find optimal settings. With search space of 70 frequency | |
| 302 | +settings, it resulted in tuning overhead of 500us and 30uJ, which includes computing inefficiencies, searching for the optimal setting | |
| 303 | +and transition the hardware to new settings. | |
| 304 | 304 | %This is not intuitive and we are investigating the cause of this anomaly |
| 305 | 305 | %\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find |
| 306 | 306 | %the reason for the change}. | ... | ... |
system_methodology.tex
| ... | ... | @@ -64,6 +64,7 @@ implementation are shaded in Figure~\ref{fig-system-block-diag}. |
| 64 | 64 | \includegraphics[width=\columnwidth,height=0.15\paperheight]{./figures/plots/496/speedup_inefficiency/heatmap_speedup.pdf} |
| 65 | 65 | \label{heatmap-speedup} |
| 66 | 66 | \end{subfigure}% |
| 67 | +\vspace{-0.5em} | |
| 67 | 68 | \caption{\textbf{Inefficiency vs. Speedup For Multiple Applications:} In |
| 68 | 69 | general, performance improves with increasing inefficiency budgets. A poorly |
| 69 | 70 | designed algorithm may select bad frequency settings which could waste energy |
| ... | ... | @@ -157,17 +158,21 @@ performance and energy data to study the impact of workload dynamics on the |
| 157 | 158 | stability of CPU and memory frequency settings delivering best performance under |
| 158 | 159 | a given inefficiency budget. Note that, all our studies are performed using |
| 159 | 160 | \textit{measured} performance and power data from the simulations, we do not \textit{predict} |
| 160 | -performance or energy. | |
| 161 | - | |
| 162 | -Although individual energy-performance trade-offs of DVFS for CPU and | |
| 163 | -DFS for memory have been studied in the past, the trade-off resulting from | |
| 164 | -the cross-component interaction of these two components has not been | |
| 165 | -characterized. CoScale~\cite{deng2012coscale} did point out that | |
| 166 | -interplay of performance and energy consumption of these two | |
| 167 | -components is complex and did present a heuristic that attempts to | |
| 168 | -pick the optimal point. In the next Section, we measure and characterize | |
| 161 | +performance or energy. The interplay of performance and energy consumption of | |
| 162 | +CPU and memory frequency scaling is complex as pointed by | |
| 163 | +CoScale~\cite{deng2012coscale}. In the next Section, we measure and characterize | |
| 169 | 164 | the larger space of all system level performance and energy trade-offs |
| 170 | 165 | of various CPU and memory frequency settings. |
| 166 | + | |
| 167 | +%Although individual energy-performance trade-offs of DVFS for CPU and | |
| 168 | +%DFS for memory have been studied in the past, the trade-off resulting from | |
| 169 | +%the cross-component interaction of these two components has not been | |
| 170 | +%characterized. CoScale~\cite{deng2012coscale} did point out that | |
| 171 | +%interplay of performance and energy consumption of these two | |
| 172 | +%components is complex and did present a heuristic that attempts to | |
| 173 | +%pick the optimal point. In the next Section, we measure and characterize | |
| 174 | +%the larger space of all system level performance and energy trade-offs | |
| 175 | +%of various CPU and memory frequency settings. | |
| 171 | 176 | %In the next section, we study how performance and |
| 172 | 177 | %inefficiency of applications varies with CPU and memory frequencies. |
| 173 | 178 | ... | ... |