diff --git a/acknowledgement.tex b/acknowledgement.tex index 0baad77..011cc33 100644 --- a/acknowledgement.tex +++ b/acknowledgement.tex @@ -1,5 +1,5 @@ \section{Acknowledgement} -This material is based on work partially supported by NSF Collaborative Awards +This material is based on work partially supported by NSF Awards CSR-1409014 and CSR-1409367. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. diff --git a/figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff.pdf b/figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff.pdf index d302e00..d275eff 100644 --- a/figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff.pdf +++ b/figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff.pdf diff --git a/figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff_cpi_mpki.pdf b/figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff_cpi_mpki.pdf index 84a831c..b3d1abc 100644 --- a/figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff_cpi_mpki.pdf +++ b/figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff_cpi_mpki.pdf diff --git a/figures/plots/496/energy_perf_bar/energy_bar_normalized_0.0_0_0.pdf b/figures/plots/496/energy_perf_bar/energy_bar_normalized_0.0_0_0.pdf index 19dea3f..3c1993e 100644 --- a/figures/plots/496/energy_perf_bar/energy_bar_normalized_0.0_0_0.pdf +++ b/figures/plots/496/energy_perf_bar/energy_bar_normalized_0.0_0_0.pdf diff --git a/figures/plots/496/energy_perf_bar/energy_bar_normalized_1.0_0_0.pdf b/figures/plots/496/energy_perf_bar/energy_bar_normalized_1.0_0_0.pdf index f96e687..0190614 100644 --- a/figures/plots/496/energy_perf_bar/energy_bar_normalized_1.0_0_0.pdf +++ b/figures/plots/496/energy_perf_bar/energy_bar_normalized_1.0_0_0.pdf diff --git a/figures/plots/496/energy_perf_bar/energy_bar_normalized_5.0_0_0.pdf b/figures/plots/496/energy_perf_bar/energy_bar_normalized_5.0_0_0.pdf index 129b957..d5a6261 100644 --- a/figures/plots/496/energy_perf_bar/energy_bar_normalized_5.0_0_0.pdf +++ b/figures/plots/496/energy_perf_bar/energy_bar_normalized_5.0_0_0.pdf diff --git a/figures/plots/496/energy_perf_bar/energy_perf_bar_1.3.pdf b/figures/plots/496/energy_perf_bar/energy_perf_bar_1.3.pdf index 930438f..c617ed8 100644 --- a/figures/plots/496/energy_perf_bar/energy_perf_bar_1.3.pdf +++ b/figures/plots/496/energy_perf_bar/energy_perf_bar_1.3.pdf diff --git a/figures/plots/496/energy_perf_bar/performance_bar_normalized_0.0_0_0.pdf b/figures/plots/496/energy_perf_bar/performance_bar_normalized_0.0_0_0.pdf index bce5748..aba9253 100644 --- a/figures/plots/496/energy_perf_bar/performance_bar_normalized_0.0_0_0.pdf +++ b/figures/plots/496/energy_perf_bar/performance_bar_normalized_0.0_0_0.pdf diff --git a/figures/plots/496/energy_perf_bar/performance_bar_normalized_1.0_0_0.pdf b/figures/plots/496/energy_perf_bar/performance_bar_normalized_1.0_0_0.pdf index b588c6e..136fd3a 100644 --- a/figures/plots/496/energy_perf_bar/performance_bar_normalized_1.0_0_0.pdf +++ b/figures/plots/496/energy_perf_bar/performance_bar_normalized_1.0_0_0.pdf diff --git a/figures/plots/496/energy_perf_bar/performance_bar_normalized_5.0_0_0.pdf b/figures/plots/496/energy_perf_bar/performance_bar_normalized_5.0_0_0.pdf index e883d3f..5569b3d 100644 --- a/figures/plots/496/energy_perf_bar/performance_bar_normalized_5.0_0_0.pdf +++ b/figures/plots/496/energy_perf_bar/performance_bar_normalized_5.0_0_0.pdf diff --git a/figures/plots/496/speedup_inefficiency/heatmap_inefficiency.pdf b/figures/plots/496/speedup_inefficiency/heatmap_inefficiency.pdf index 282143c..7373a63 100644 --- a/figures/plots/496/speedup_inefficiency/heatmap_inefficiency.pdf +++ b/figures/plots/496/speedup_inefficiency/heatmap_inefficiency.pdf diff --git a/figures/plots/496/stable_length_box/stable_length_box.pdf b/figures/plots/496/stable_length_box/stable_length_box.pdf index e9a2769..44d436b 100644 --- a/figures/plots/496/stable_length_box/stable_length_box.pdf +++ b/figures/plots/496/stable_length_box/stable_length_box.pdf diff --git a/figures/plots/496/stable_line_plots/lbm_stable_lineplot_annotated_5.pdf b/figures/plots/496/stable_line_plots/lbm_stable_lineplot_annotated_5.pdf index 923f68d..84f0b25 100644 --- a/figures/plots/496/stable_line_plots/lbm_stable_lineplot_annotated_5.pdf +++ b/figures/plots/496/stable_line_plots/lbm_stable_lineplot_annotated_5.pdf diff --git a/figures/plots/496/stable_line_plots/stable_lineplot.pdf b/figures/plots/496/stable_line_plots/stable_lineplot.pdf index ead6969..9b45475 100644 --- a/figures/plots/496/stable_line_plots/stable_lineplot.pdf +++ b/figures/plots/496/stable_line_plots/stable_lineplot.pdf diff --git a/inefficiency.tex b/inefficiency.tex index 8476405..dd2c17a 100644 --- a/inefficiency.tex +++ b/inefficiency.tex @@ -121,11 +121,11 @@ both the energy ($E$) consumed by the application and the minimum energy % Computing $E$ is straight forward; Intel Sandy bridge architecture ~\cite{sandy-bridge-sw-manual} already -provides performance counters capable of measuring energy consumption at +provides counters capable of measuring energy consumption at runtime and the research community has tools and models to estimate the absolute energy of applications~\cite{brooks2000wattch,drampower-tool,li2009mcpat,micronpowercalc-lpddr3-url,wilton1996cacti}. -Computing $E_{min}$ is challenging because of the inter-component +Computing $E_{min}$ is challenging due to inter-component dependencies. % We propose two methods for computing $E_{min}$: @@ -151,36 +151,37 @@ We propose two methods for computing $E_{min}$: \end{itemize} -%We are working towards designing efficient energy prediction models for CPU, -%memory and network components. +We are working towards designing efficient energy prediction models for CPU and +memory. % -%Our models consider cross-component interactions on performance and energy -%consumption. +Our models consider cross-component interactions on performance and energy +consumption. % %%%%%%%% MODEL %%%%%%%%%% -We designed efficient models to predict performance and energy consumption of -CPU and memory at various voltage and frequency settings for a given -application. We plan on using these models to estimate $E_{min}$ of a given set -of instructions. -%We envision a system capable of scaling voltage and frequency of CPU and only -%frequency of DRAM. -Our models consider cross-component interactions on performance and energy. -The performance model uses hardware performance counters to measure amount of time -each component is $Busy$ completing the work, $Idle$ stalled on the other -component and $Waiting$ for more work. We designed systematic methodology to -scale these states to estimate execution time of a given workload at different -voltage and frequency settings. In our model, the $Idle$ time of one component -depends on the settings of the second component. The $Busy$ time of each -component scales with it's own frequency. However, part of the $Busy$ time that -overlaps with the other component is constrained by the slowest component. - -We combine predicted performance with the power models of CPU and memory -described in Section~\ref{subsec-energy-models} to estimate energy consumption -of CPU and memory. Our model has average prediction error of 4\% across SPEC -CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm -(24\%)$. In this work we demonstrate how to use inefficiency, deferring -optimization of $E_{min}$ prediction to future work. +%We designed efficient models to predict performance and energy consumption of +%CPU and memory at various voltage and frequency settings for a given +%application. We plan on using these models to estimate $E_{min}$ of a given set +%of instructions. +%%We envision a system capable of scaling voltage and frequency of CPU and only +%%frequency of DRAM. +%Our models consider cross-component interactions on performance and energy. +%The performance model uses hardware performance counters to measure amount of time +%each component is $Busy$ completing the work, $Idle$ stalled on the other +%component and $Waiting$ for more work. We designed systematic methodology to +%scale these states to estimate execution time of a given workload at different +%voltage and frequency settings. In our model, the $Idle$ time of one component +%depends on the settings of the second component. The $Busy$ time of each +%component scales with it's own frequency. However, part of the $Busy$ time that +%overlaps with the other component is constrained by the slowest component. +% +%We combine predicted performance with the power models of CPU and memory +%described in Section~\ref{subsec-energy-models} to estimate energy consumption +%of CPU and memory. Our model has average prediction error of 4\% across SPEC +%CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm +%(24\%)$. %%%%% END OF MODEL %%%%%% +In this work we demonstrate how to use inefficiency, deferring predicting and +optimizing $E_{min}$ to future work. \subsection{Managing Inefficiency} % diff --git a/inefficiency_speedup.tex b/inefficiency_speedup.tex index 5182d36..c895c30 100644 --- a/inefficiency_speedup.tex +++ b/inefficiency_speedup.tex @@ -17,8 +17,8 @@ frequency settings may burn extra energy without improving performance. We performed offline analysis of the data collected from our simulations to study the inefficiency-performance trends for various benchmarks. With a brute -force search, we found $E_{min}$ and computed inefficiency at all frequency -settings. We express performance in terms of $speedup$, the ratio of execution +force search, we found $E_{min}$ and computed inefficiency at all %frequency +settings. We express performance in terms of $speedup$, the ratio of execution time for a given configuration to the longest execution time. % to the execution time at %a given frequency setting. @@ -66,7 +66,7 @@ example, \textit{gobmk} runs 1.5x slower if it is forced to run at budget of the inefficiency constraint and \textbf{not} just \textbf{at} the inefficiency constraint.} Algorithms forcing the system to run exactly at given budget might end up wasting energy or, even worse, degrading performance. A smart algorithm should -a) use no more than given inefficiency budget b) should use only as much +a) stay under given inefficiency budget b) should use only as much inefficiency budget as needed c) and deliver the best performance. %\end{enumerate} diff --git a/introduction.tex b/introduction.tex index 1de7cb4..6805359 100644 --- a/introduction.tex +++ b/introduction.tex @@ -19,7 +19,8 @@ Still other hardware energy-performance tradeoffs are on the horizon, arising from capabilities such as memory frequency scaling~\cite{david2011memory} and nanosecond-speed DVFS emerging in next-generation hardware designs~\cite{6084810}. -We envision a next-generation smartphone capable of both CPU and memory DVFS. +We envision a next-generation smartphone capable of scaling both voltage and +frequency of CPU and only frequency of memory. % While the addition of memory DVFS can be used to improve energy-constrained performance, the larger frequency state space compared to CPU DVFS alone also diff --git a/optimal_performance.tex b/optimal_performance.tex index 93e86a9..3fb9bd6 100644 --- a/optimal_performance.tex +++ b/optimal_performance.tex @@ -7,8 +7,8 @@ \vspace{-0.5em} \caption{\textbf{Optimal Performance Point for \text{Gobmk} Across Inefficiencies:} At low inefficiency budgets, the optimal frequency settings follow CPI of the -application, and select high memory frequencies for memory intensive phases with -high CPI. +application, and select high memory frequencies for memory intensive phases. % with +%high CPI. %to deliver best %performance under given inefficiency constraint. Higher inefficiency budgets @@ -62,7 +62,7 @@ and then memory frequency as this setting is bound to have highest performance a the other possibilities. Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all -benchmark samples (each of length 10 million instructions) across multiple +benchmark samples (each of length 10~M instructions) across multiple inefficiency constraints. At low inefficiencies, the optimal settings follow the trends in CPI (cycles per instruction) and MPKI (misses per thousand instructions). Regions of higher CPI correspond to memory intensive phases, as @@ -71,7 +71,7 @@ the SPEC benchmarks don't have any IO or interrupt based portions. %The higher the CPI is, the higher %the memory frequency of the optimal settings is (sample 7) to serve high memory %traffic. -For phases that are CPU intensive with (lower CPI), the optimal settings have +For phases that are CPU intensive (lower CPI), the optimal settings have higher CPU frequency and lower memory frequency. % (sample 9 and 10). At low At low inefficiency constraints, due to the limited energy budget, a careful allocation of energy across components becomes critical to achieve optimal @@ -92,8 +92,10 @@ There are two key problems associated with tracking the optimal settings: \noindent \textit{It is expensive.} Running the tuning algorithm at the end of every sample to track optimal settings comes at a cost: 1) searching and discovering the optimal settings 2) real hardware has transition latency -overhead for both the CPU and the memory frequency. -%transitions in . +overhead for both the CPU and the memory frequency. For example, while the search +algorithm presented by CoScale~\cite{deng2012coscale} takes 5us to find optimal +frequency settings, time taken by PLLs to change voltage and frequency in commercial processors is in the +order of 10s of microseconds. Reducing the frequency at which tuning algorithms need to re-tune is critical to reduce the cost of tuning overhead on application performance. @@ -105,7 +107,7 @@ highest performance). For example, \textit{bzip2} is CPU bound and therefore its performance at memory frequency of 200MHz is within 3\% of performance at a memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that 3\% of performance, the system could have consumed 1/4 the memory background -energy staying well under the given inefficiency budget. +energy saving 2.7\% of the system energy and staying well under the given inefficiency budget. %\end{enumerate} We believe that, if the user is willing to sacrifice some performance under diff --git a/performance_clusters.tex b/performance_clusters.tex index 21eb96a..0f65861 100644 --- a/performance_clusters.tex +++ b/performance_clusters.tex @@ -13,6 +13,7 @@ stable regions and vertical dashed lines mark the transitions made by \begin{figure*}[t] \begin{subfigure}[t]{\textwidth} \centering + \vspace{-1em} \includegraphics[width=\columnwidth]{{./figures/plots/496/stable_line_plots/stable_lineplot}.pdf} \end{subfigure}% \vspace{0.5em} @@ -242,8 +243,8 @@ in highest number of transitions. A common observation is that the number of tra increase in cluster threshold. For \textit{bzip2}, increase in inefficiency from 1.0 to 1.3 increases the number of transitions needed to track the optimal settings. The number of -available settings increase with inefficiency, giving more choices for the -optimal frequency settings. At an inefficiency budget of 1.6, the average length of a stable region +available settings increase with inefficiency increasing the average length of +stable regions. At an inefficiency budget of 1.6, the average length of a stable region increases drastically as shown in Figure~\ref{box-lengths}(b), which requires much less transitions with 1\% cluster threshold and no transitions with higher cluster thresholds of 3\% and 5\%. Note that there is only one point on the box plot for 3\% and 5\% @@ -294,13 +295,12 @@ at optimal settings. Figures~\ref{energy-perf-trade-off}(a) and selecting the settings that don't degrade performance more than specified cluster threshold. The figure also shows that with an increase in cluster threshold, energy consumption decreases because lower frequency settings can be -chosen at higher cluster thresholds. Figure~\ref{energy-perf-trade-off}(b) shows that -performance (and energy) may improve when tuning overhead is included due to decrease in -frequency transitions. We assume tuning overhead -of 500us and 30uJ, which includes computing inefficiencies, searching for the -optimal setting and transition the hardware to new -settings~\cite{deng2012coscale}. We assumed that a space of 100 settings is -searched for every transition. +chosen at higher cluster thresholds. Figure~\ref{energy-perf-trade-off}(b) shows +that performance (and energy) may improve when tuning overhead is included due +to decrease in frequency transitions. To determine tuning overhead, we wrote a +simple algorithm to find optimal settings. With search space of 70 frequency +settings, it resulted in tuning overhead of 500us and 30uJ, which includes computing inefficiencies, searching for the optimal setting +and transition the hardware to new settings. %This is not intuitive and we are investigating the cause of this anomaly %\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find %the reason for the change}. diff --git a/system_methodology.tex b/system_methodology.tex index bc9e197..0ae1a41 100644 --- a/system_methodology.tex +++ b/system_methodology.tex @@ -64,6 +64,7 @@ implementation are shaded in Figure~\ref{fig-system-block-diag}. \includegraphics[width=\columnwidth,height=0.15\paperheight]{./figures/plots/496/speedup_inefficiency/heatmap_speedup.pdf} \label{heatmap-speedup} \end{subfigure}% +\vspace{-0.5em} \caption{\textbf{Inefficiency vs. Speedup For Multiple Applications:} In general, performance improves with increasing inefficiency budgets. A poorly designed algorithm may select bad frequency settings which could waste energy @@ -157,17 +158,21 @@ performance and energy data to study the impact of workload dynamics on the stability of CPU and memory frequency settings delivering best performance under a given inefficiency budget. Note that, all our studies are performed using \textit{measured} performance and power data from the simulations, we do not \textit{predict} -performance or energy. - -Although individual energy-performance trade-offs of DVFS for CPU and -DFS for memory have been studied in the past, the trade-off resulting from -the cross-component interaction of these two components has not been -characterized. CoScale~\cite{deng2012coscale} did point out that -interplay of performance and energy consumption of these two -components is complex and did present a heuristic that attempts to -pick the optimal point. In the next Section, we measure and characterize +performance or energy. The interplay of performance and energy consumption of +CPU and memory frequency scaling is complex as pointed by +CoScale~\cite{deng2012coscale}. In the next Section, we measure and characterize the larger space of all system level performance and energy trade-offs of various CPU and memory frequency settings. + +%Although individual energy-performance trade-offs of DVFS for CPU and +%DFS for memory have been studied in the past, the trade-off resulting from +%the cross-component interaction of these two components has not been +%characterized. CoScale~\cite{deng2012coscale} did point out that +%interplay of performance and energy consumption of these two +%components is complex and did present a heuristic that attempts to +%pick the optimal point. In the next Section, we measure and characterize +%the larger space of all system level performance and energy trade-offs +%of various CPU and memory frequency settings. %In the next section, we study how performance and %inefficiency of applications varies with CPU and memory frequencies.