diff --git a/inefficiency.tex b/inefficiency.tex index 733055b..c3f8dfc 100644 --- a/inefficiency.tex +++ b/inefficiency.tex @@ -180,13 +180,13 @@ consumption. %CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm %(24\%)$. %%%%% END OF MODEL %%%%%% -In this work we demonstrate how to use inefficiency, deferring predicting and +In this work we demonstrate how to use inefficiency and defer both predicting and optimizing $E_{min}$ to future work. \subsection{Managing Inefficiency} % Future energy management algorithms need to tune system settings to keep the -system within specified inefficiency budget and deliver the best performance. +system within the specified inefficiency budget and deliver the best performance. % Techniques that use predictors such as instructions-per-cycle (IPC) to decide when to use DVFS or migrate threads can be extended to operate under given @@ -202,7 +202,7 @@ under performance constraints, some have the potential to be modified to work under energy constraints and thus could operate under inefficiency budget~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}. % -We leave building some of these algorithms into a system as future work. +We leave incorporating some of these algorithms into a system as future work. % In this paper, we characterize the optimal performance point under different inefficiency constraints and illustrate that the stability of these points diff --git a/inefficiency_speedup.tex b/inefficiency_speedup.tex index c895c30..864cfe6 100644 --- a/inefficiency_speedup.tex +++ b/inefficiency_speedup.tex @@ -27,7 +27,7 @@ Figure~\ref{heatmaps} plots the speedup and inefficiency for three workloads operating with various CPU and memory frequencies. As the figure shows, the ability of a workload to trade-off energy and performance using CPU and memory frequency, depends on its mix of CPU and memory instructions. For CPU intensive -workloads like \textit{bzip2}, speedup varies with only CPU frequency, and +workloads like \textit{bzip2}, speedup varies only with CPU frequency; memory frequency has no impact on speedup. For workloads that have balanced CPU and memory intensive phases like \textit{gobmk}, speedup varies with both CPU and memory frequency. The \textit{milc} benchmark has some memory intensive @@ -44,7 +44,7 @@ We make three major observations: efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much that its overall energy consumption increases, thereby resulting in -inefficiency of 1.55 for \textit{gobmk}. Algorithms that choose these frequency settings spend +inefficiency of 1.55. Algorithms that choose these frequency settings spend 55\% more energy without any performance improvement. %The converse is also true %as noted by our second observation. diff --git a/introduction.tex b/introduction.tex index 6805359..b3cce6f 100644 --- a/introduction.tex +++ b/introduction.tex @@ -19,10 +19,10 @@ Still other hardware energy-performance tradeoffs are on the horizon, arising from capabilities such as memory frequency scaling~\cite{david2011memory} and nanosecond-speed DVFS emerging in next-generation hardware designs~\cite{6084810}. -We envision a next-generation smartphone capable of scaling both voltage and -frequency of CPU and only frequency of memory. +We envision a next-generation smartphone capable of CPU DVFS (Dynamic Voltage +and Frequency Scaling) and memory DFS (Dynamic Frequency Scaling). % -While the addition of memory DVFS can be used to improve energy-constrained +While the addition of memory DFS can be used to improve energy-constrained performance, the larger frequency state space compared to CPU DVFS alone also provides more incorrect settings that waste energy or degrade performance. % @@ -33,7 +33,7 @@ energy constraints. Our work represents two advances over previous efforts. % First, while previous works have explored energy minimizations using DVFS -under performance constraints focusing on reducing slack~\cite{deng2012coscale}, we are the first to +under performance constraints focusing on reducing slack, we are the first to study the potential DVFS settings under an energy constraint. % Specifying performance constraints for servers is appropriate, since they are @@ -75,7 +75,7 @@ performance. % \item We study the energy-performance trade-offs of systems that are capable of both CPU and memory DVFS for multiple applications. We show that poor -frequency selection can both hurt performance and energy consumption. +frequency selection can hurt both performance and energy consumption. % \item We characterize the optimal frequency settings for multiple applications and inefficiency budgets. We introduce \textit{performance @@ -87,10 +87,10 @@ management algorithms. % \end{enumerate} -We use the \texttt{Gem5} simulator, the Android smartphone platform and Linux +We use the \texttt{gem5} simulator, the Android smartphone platform and Linux kernel, and an empirical power model to (1) measure the inefficiency of several applications for a wide range of frequency settings, (2) compute -performance clusters, and (3) study how they evolve. +performance clusters, and (3) study how performance clusters evolve. % We are currently constructing a complete system to study tuning algorithms that can build on our insights to adaptively choose frequency settings at diff --git a/optimal_performance.tex b/optimal_performance.tex index 3fb9bd6..725351d 100644 --- a/optimal_performance.tex +++ b/optimal_performance.tex @@ -107,7 +107,7 @@ highest performance). For example, \textit{bzip2} is CPU bound and therefore its performance at memory frequency of 200MHz is within 3\% of performance at a memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that 3\% of performance, the system could have consumed 1/4 the memory background -energy saving 2.7\% of the system energy and staying well under the given inefficiency budget. +energy, saving 2.7\% of the system energy and staying well under the given inefficiency budget. %\end{enumerate} We believe that, if the user is willing to sacrifice some performance under diff --git a/performance_clusters.tex b/performance_clusters.tex index 04a8a90..ff09118 100644 --- a/performance_clusters.tex +++ b/performance_clusters.tex @@ -48,7 +48,7 @@ the system. \subsection{Performance Clusters} We search for the performance clusters using an algorithm that is similar to the approach we used to find the optimal settings. We -first filter the settings that fall within a given inefficiency budget, and +first filter the settings that fall within a given inefficiency budget and then search for the optimal settings in the first pass. In the second pass, we find all of the settings that have a speedup within the specified \textit{cluster threshold} of the optimal performance. @@ -95,7 +95,7 @@ compromising performance by setting low inefficiency budgets to save energy. Figures~\ref{clusters-gobmk}(c),~\ref{clusters-gobmk}(d) plot the performance clusters for \textit{gobmk} for inefficiency budget of 1.3 and -cluster thresholds of 1\% and 5\% respectively. As we saw in +cluster thresholds of 1\% and 5\% respectively. As we observed in Figure~\ref{gobmk-optimal}, the optimal settings for \textit{gobmk} change every sample (of length 10 million instructions) and follows application phases (CPI). Figure~\ref{clusters-gobmk}(c) shows that by @@ -118,7 +118,8 @@ Figures~\ref{clusters-gobmk}(a),~\ref{clusters-gobmk}(c) plot the performance clusters for \textit{gobmk} for two different inefficiency budgets of 1.0 and 1.3 for cluster threshold of 1\%. %\XXXnote{reword next sentence? -Dave} -Not all of the stable regions increase in length with increasing inefficiency but instead depends on the workload. +Not all of the stable regions increase in length with increasing inefficiency; +this trend varies with workloads. %Increase in the length of stable regions with increase in %inefficiency is a %function of workload characteristics. @@ -344,8 +345,8 @@ runs at one setting, sample 8-9 runs at another setting and sample 10 runs at a different setting due to the availability of more (and better) choices. %\XXXnote{sounds wordy -Dave}. In our system, we observed only a small improvement in performance (\textless -1\%) with higher number of frequency steps when -tuning is free as optimal +1\%) with an increased number of frequency steps when +tuning is free, as optimal settings in both cases were off by only a few MHz. It is the balance between the tuning overhead and the energy-performance savings that is critical in deciding the correct size of the search space. diff --git a/system_methodology.tex b/system_methodology.tex index 0ae1a41..6f16544 100644 --- a/system_methodology.tex +++ b/system_methodology.tex @@ -2,7 +2,7 @@ \label{sec-sys-methodology} Energy management algorithms must tune the underlying hardware components to keep the system within the given inefficiency budget. Hardware components -provide multiple knobs that can be tuned to trade-off performance for energy +provide multiple "knobs" that can be tuned to trade-off performance for energy savings. For example, the energy consumed by the CPU can be managed by tuning its frequency and voltage. %DRAM energy can be @@ -121,11 +121,12 @@ being 1.25V. %0.02V/30MHz. The voltage and frequency pairs match with the frequency steps %used by the Nexus S. -For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page -policy. Timing and current parameters for LPDDR3 are configured as specified in -data sheets from Micron~\cite{micronspec-url}. Memory clock domain is configured with a -frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory -voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. +For the memory system, we simulated a LPDDR3 single channel, one rank memory +using an open-page access policy. Timing and current parameters for LPDDR3 are +configured as specified in data sheets from Micron~\cite{micronspec-url}. Memory +clock domain is configured with a frequency range of 200MHz to 800MHz. As +mentioned earlier, we did not scale memory voltage. The power supplies---VDD and +VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. We first simulated 12 integer and 9 floating point SPEC CPU2006 benchmarks~\cite{henning2006spec}, with each benchmark either running to