Commit 0e5dc2a4c5e8d9723ccafbe755ff77c3269835e0
1 parent
44d949a6
+ dave's comments
Showing
6 changed files
with
26 additions
and
24 deletions
inefficiency.tex
| @@ -180,13 +180,13 @@ consumption. | @@ -180,13 +180,13 @@ consumption. | ||
| 180 | %CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm | 180 | %CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm |
| 181 | %(24\%)$. | 181 | %(24\%)$. |
| 182 | %%%%% END OF MODEL %%%%%% | 182 | %%%%% END OF MODEL %%%%%% |
| 183 | -In this work we demonstrate how to use inefficiency, deferring predicting and | 183 | +In this work we demonstrate how to use inefficiency and defer both predicting and |
| 184 | optimizing $E_{min}$ to future work. | 184 | optimizing $E_{min}$ to future work. |
| 185 | 185 | ||
| 186 | \subsection{Managing Inefficiency} | 186 | \subsection{Managing Inefficiency} |
| 187 | % | 187 | % |
| 188 | Future energy management algorithms need to tune system settings to keep the | 188 | Future energy management algorithms need to tune system settings to keep the |
| 189 | -system within specified inefficiency budget and deliver the best performance. | 189 | +system within the specified inefficiency budget and deliver the best performance. |
| 190 | % | 190 | % |
| 191 | Techniques that use predictors such as instructions-per-cycle (IPC) to decide | 191 | Techniques that use predictors such as instructions-per-cycle (IPC) to decide |
| 192 | when to use DVFS or migrate threads can be extended to operate under given | 192 | when to use DVFS or migrate threads can be extended to operate under given |
| @@ -202,7 +202,7 @@ under performance constraints, some have the potential to be modified to work | @@ -202,7 +202,7 @@ under performance constraints, some have the potential to be modified to work | ||
| 202 | under energy constraints and thus could operate under | 202 | under energy constraints and thus could operate under |
| 203 | inefficiency budget~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}. | 203 | inefficiency budget~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}. |
| 204 | % | 204 | % |
| 205 | -We leave building some of these algorithms into a system as future work. | 205 | +We leave incorporating some of these algorithms into a system as future work. |
| 206 | % | 206 | % |
| 207 | In this paper, we characterize the optimal performance point under different | 207 | In this paper, we characterize the optimal performance point under different |
| 208 | inefficiency constraints and illustrate that the stability of these points | 208 | inefficiency constraints and illustrate that the stability of these points |
inefficiency_speedup.tex
| @@ -27,7 +27,7 @@ Figure~\ref{heatmaps} plots the speedup and inefficiency for three workloads | @@ -27,7 +27,7 @@ Figure~\ref{heatmaps} plots the speedup and inefficiency for three workloads | ||
| 27 | operating with various CPU and memory frequencies. As the figure shows, the | 27 | operating with various CPU and memory frequencies. As the figure shows, the |
| 28 | ability of a workload to trade-off energy and performance using CPU and memory | 28 | ability of a workload to trade-off energy and performance using CPU and memory |
| 29 | frequency, depends on its mix of CPU and memory instructions. For CPU intensive | 29 | frequency, depends on its mix of CPU and memory instructions. For CPU intensive |
| 30 | -workloads like \textit{bzip2}, speedup varies with only CPU frequency, and | 30 | +workloads like \textit{bzip2}, speedup varies only with CPU frequency; |
| 31 | memory frequency has no impact on speedup. For workloads that have balanced CPU | 31 | memory frequency has no impact on speedup. For workloads that have balanced CPU |
| 32 | and memory intensive phases like \textit{gobmk}, speedup varies with both CPU | 32 | and memory intensive phases like \textit{gobmk}, speedup varies with both CPU |
| 33 | and memory frequency. The \textit{milc} benchmark has some memory intensive | 33 | and memory frequency. The \textit{milc} benchmark has some memory intensive |
| @@ -44,7 +44,7 @@ We make three major observations: | @@ -44,7 +44,7 @@ We make three major observations: | ||
| 44 | efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and | 44 | efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and |
| 45 | memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much | 45 | memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much |
| 46 | that its overall energy consumption increases, thereby resulting in | 46 | that its overall energy consumption increases, thereby resulting in |
| 47 | -inefficiency of 1.55 for \textit{gobmk}. Algorithms that choose these frequency settings spend | 47 | +inefficiency of 1.55. Algorithms that choose these frequency settings spend |
| 48 | 55\% more energy without any performance improvement. | 48 | 55\% more energy without any performance improvement. |
| 49 | %The converse is also true | 49 | %The converse is also true |
| 50 | %as noted by our second observation. | 50 | %as noted by our second observation. |
introduction.tex
| @@ -19,10 +19,10 @@ Still other hardware energy-performance tradeoffs are on the horizon, arising | @@ -19,10 +19,10 @@ Still other hardware energy-performance tradeoffs are on the horizon, arising | ||
| 19 | from capabilities such as memory frequency scaling~\cite{david2011memory} and nanosecond-speed DVFS | 19 | from capabilities such as memory frequency scaling~\cite{david2011memory} and nanosecond-speed DVFS |
| 20 | emerging in next-generation hardware designs~\cite{6084810}. | 20 | emerging in next-generation hardware designs~\cite{6084810}. |
| 21 | 21 | ||
| 22 | -We envision a next-generation smartphone capable of scaling both voltage and | ||
| 23 | -frequency of CPU and only frequency of memory. | 22 | +We envision a next-generation smartphone capable of CPU DVFS (Dynamic Voltage |
| 23 | +and Frequency Scaling) and memory DFS (Dynamic Frequency Scaling). | ||
| 24 | % | 24 | % |
| 25 | -While the addition of memory DVFS can be used to improve energy-constrained | 25 | +While the addition of memory DFS can be used to improve energy-constrained |
| 26 | performance, the larger frequency state space compared to CPU DVFS alone also | 26 | performance, the larger frequency state space compared to CPU DVFS alone also |
| 27 | provides more incorrect settings that waste energy or degrade performance. | 27 | provides more incorrect settings that waste energy or degrade performance. |
| 28 | % | 28 | % |
| @@ -33,7 +33,7 @@ energy constraints. | @@ -33,7 +33,7 @@ energy constraints. | ||
| 33 | Our work represents two advances over previous efforts. | 33 | Our work represents two advances over previous efforts. |
| 34 | % | 34 | % |
| 35 | First, while previous works have explored energy minimizations using DVFS | 35 | First, while previous works have explored energy minimizations using DVFS |
| 36 | -under performance constraints focusing on reducing slack~\cite{deng2012coscale}, we are the first to | 36 | +under performance constraints focusing on reducing slack, we are the first to |
| 37 | study the potential DVFS settings under an energy constraint. | 37 | study the potential DVFS settings under an energy constraint. |
| 38 | % | 38 | % |
| 39 | Specifying performance constraints for servers is appropriate, since they are | 39 | Specifying performance constraints for servers is appropriate, since they are |
| @@ -75,7 +75,7 @@ performance. | @@ -75,7 +75,7 @@ performance. | ||
| 75 | % | 75 | % |
| 76 | \item We study the energy-performance trade-offs of systems that are capable | 76 | \item We study the energy-performance trade-offs of systems that are capable |
| 77 | of both CPU and memory DVFS for multiple applications. We show that poor | 77 | of both CPU and memory DVFS for multiple applications. We show that poor |
| 78 | -frequency selection can both hurt performance and energy consumption. | 78 | +frequency selection can hurt both performance and energy consumption. |
| 79 | % | 79 | % |
| 80 | \item We characterize the optimal frequency settings for multiple | 80 | \item We characterize the optimal frequency settings for multiple |
| 81 | applications and inefficiency budgets. We introduce \textit{performance | 81 | applications and inefficiency budgets. We introduce \textit{performance |
| @@ -87,10 +87,10 @@ management algorithms. | @@ -87,10 +87,10 @@ management algorithms. | ||
| 87 | % | 87 | % |
| 88 | \end{enumerate} | 88 | \end{enumerate} |
| 89 | 89 | ||
| 90 | -We use the \texttt{Gem5} simulator, the Android smartphone platform and Linux | 90 | +We use the \texttt{gem5} simulator, the Android smartphone platform and Linux |
| 91 | kernel, and an empirical power model to (1) measure the inefficiency of | 91 | kernel, and an empirical power model to (1) measure the inefficiency of |
| 92 | several applications for a wide range of frequency settings, (2) compute | 92 | several applications for a wide range of frequency settings, (2) compute |
| 93 | -performance clusters, and (3) study how they evolve. | 93 | +performance clusters, and (3) study how performance clusters evolve. |
| 94 | % | 94 | % |
| 95 | We are currently constructing a complete system to study tuning algorithms | 95 | We are currently constructing a complete system to study tuning algorithms |
| 96 | that can build on our insights to adaptively choose frequency settings at | 96 | that can build on our insights to adaptively choose frequency settings at |
optimal_performance.tex
| @@ -107,7 +107,7 @@ highest performance). For example, \textit{bzip2} is CPU bound and therefore | @@ -107,7 +107,7 @@ highest performance). For example, \textit{bzip2} is CPU bound and therefore | ||
| 107 | its performance at memory frequency of 200MHz is within 3\% of performance at a | 107 | its performance at memory frequency of 200MHz is within 3\% of performance at a |
| 108 | memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that | 108 | memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that |
| 109 | 3\% of performance, the system could have consumed 1/4 the memory background | 109 | 3\% of performance, the system could have consumed 1/4 the memory background |
| 110 | -energy saving 2.7\% of the system energy and staying well under the given inefficiency budget. | 110 | +energy, saving 2.7\% of the system energy and staying well under the given inefficiency budget. |
| 111 | %\end{enumerate} | 111 | %\end{enumerate} |
| 112 | 112 | ||
| 113 | We believe that, if the user is willing to sacrifice some performance under | 113 | We believe that, if the user is willing to sacrifice some performance under |
performance_clusters.tex
| @@ -48,7 +48,7 @@ the system. | @@ -48,7 +48,7 @@ the system. | ||
| 48 | 48 | ||
| 49 | \subsection{Performance Clusters} | 49 | \subsection{Performance Clusters} |
| 50 | We search for the performance clusters using an algorithm that is similar to the approach we used to find the optimal settings. We | 50 | We search for the performance clusters using an algorithm that is similar to the approach we used to find the optimal settings. We |
| 51 | -first filter the settings that fall within a given inefficiency budget, and | 51 | +first filter the settings that fall within a given inefficiency budget and |
| 52 | then search for the optimal settings in the first pass. In the second pass, we find all of the | 52 | then search for the optimal settings in the first pass. In the second pass, we find all of the |
| 53 | settings that have a speedup within the specified \textit{cluster threshold} of the optimal performance. | 53 | settings that have a speedup within the specified \textit{cluster threshold} of the optimal performance. |
| 54 | 54 | ||
| @@ -95,7 +95,7 @@ compromising performance by setting low inefficiency budgets to save energy. | @@ -95,7 +95,7 @@ compromising performance by setting low inefficiency budgets to save energy. | ||
| 95 | 95 | ||
| 96 | Figures~\ref{clusters-gobmk}(c),~\ref{clusters-gobmk}(d) plot the | 96 | Figures~\ref{clusters-gobmk}(c),~\ref{clusters-gobmk}(d) plot the |
| 97 | performance clusters for \textit{gobmk} for inefficiency budget of 1.3 and | 97 | performance clusters for \textit{gobmk} for inefficiency budget of 1.3 and |
| 98 | -cluster thresholds of 1\% and 5\% respectively. As we saw in | 98 | +cluster thresholds of 1\% and 5\% respectively. As we observed in |
| 99 | Figure~\ref{gobmk-optimal}, the optimal settings for \textit{gobmk} change | 99 | Figure~\ref{gobmk-optimal}, the optimal settings for \textit{gobmk} change |
| 100 | every sample (of length 10 million instructions) and follows | 100 | every sample (of length 10 million instructions) and follows |
| 101 | application phases (CPI). Figure~\ref{clusters-gobmk}(c) shows that by | 101 | application phases (CPI). Figure~\ref{clusters-gobmk}(c) shows that by |
| @@ -118,7 +118,8 @@ Figures~\ref{clusters-gobmk}(a),~\ref{clusters-gobmk}(c) plot the performance | @@ -118,7 +118,8 @@ Figures~\ref{clusters-gobmk}(a),~\ref{clusters-gobmk}(c) plot the performance | ||
| 118 | clusters for \textit{gobmk} for two different inefficiency budgets of 1.0 and 1.3 for | 118 | clusters for \textit{gobmk} for two different inefficiency budgets of 1.0 and 1.3 for |
| 119 | cluster threshold of 1\%. | 119 | cluster threshold of 1\%. |
| 120 | %\XXXnote{reword next sentence? -Dave} | 120 | %\XXXnote{reword next sentence? -Dave} |
| 121 | -Not all of the stable regions increase in length with increasing inefficiency but instead depends on the workload. | 121 | +Not all of the stable regions increase in length with increasing inefficiency; |
| 122 | +this trend varies with workloads. | ||
| 122 | %Increase in the length of stable regions with increase in | 123 | %Increase in the length of stable regions with increase in |
| 123 | %inefficiency is a | 124 | %inefficiency is a |
| 124 | %function of workload characteristics. | 125 | %function of workload characteristics. |
| @@ -344,8 +345,8 @@ runs at one setting, sample 8-9 runs at another setting and sample 10 runs at a | @@ -344,8 +345,8 @@ runs at one setting, sample 8-9 runs at another setting and sample 10 runs at a | ||
| 344 | different setting due to the availability of more (and better) choices. | 345 | different setting due to the availability of more (and better) choices. |
| 345 | %\XXXnote{sounds wordy -Dave}. | 346 | %\XXXnote{sounds wordy -Dave}. |
| 346 | In our system, we observed only a small improvement in performance (\textless | 347 | In our system, we observed only a small improvement in performance (\textless |
| 347 | -1\%) with higher number of frequency steps when | ||
| 348 | -tuning is free as optimal | 348 | +1\%) with an increased number of frequency steps when |
| 349 | +tuning is free, as optimal | ||
| 349 | settings in both cases were off by only a few MHz. It is the balance between the | 350 | settings in both cases were off by only a few MHz. It is the balance between the |
| 350 | tuning overhead and the energy-performance savings that is | 351 | tuning overhead and the energy-performance savings that is |
| 351 | critical in deciding the correct size of the search space. | 352 | critical in deciding the correct size of the search space. |
system_methodology.tex
| @@ -2,7 +2,7 @@ | @@ -2,7 +2,7 @@ | ||
| 2 | \label{sec-sys-methodology} | 2 | \label{sec-sys-methodology} |
| 3 | Energy management algorithms must tune the underlying hardware components to | 3 | Energy management algorithms must tune the underlying hardware components to |
| 4 | keep the system within the given inefficiency budget. Hardware components | 4 | keep the system within the given inefficiency budget. Hardware components |
| 5 | -provide multiple knobs that can be tuned to trade-off performance for energy | 5 | +provide multiple "knobs" that can be tuned to trade-off performance for energy |
| 6 | savings. For example, the energy consumed by the CPU can be managed by tuning | 6 | savings. For example, the energy consumed by the CPU can be managed by tuning |
| 7 | its frequency and voltage. | 7 | its frequency and voltage. |
| 8 | %DRAM energy can be | 8 | %DRAM energy can be |
| @@ -121,11 +121,12 @@ being 1.25V. | @@ -121,11 +121,12 @@ being 1.25V. | ||
| 121 | %0.02V/30MHz. The voltage and frequency pairs match with the frequency steps | 121 | %0.02V/30MHz. The voltage and frequency pairs match with the frequency steps |
| 122 | %used by the Nexus S. | 122 | %used by the Nexus S. |
| 123 | 123 | ||
| 124 | -For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page | ||
| 125 | -policy. Timing and current parameters for LPDDR3 are configured as specified in | ||
| 126 | -data sheets from Micron~\cite{micronspec-url}. Memory clock domain is configured with a | ||
| 127 | -frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory | ||
| 128 | -voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. | 124 | +For the memory system, we simulated a LPDDR3 single channel, one rank memory |
| 125 | +using an open-page access policy. Timing and current parameters for LPDDR3 are | ||
| 126 | +configured as specified in data sheets from Micron~\cite{micronspec-url}. Memory | ||
| 127 | +clock domain is configured with a frequency range of 200MHz to 800MHz. As | ||
| 128 | +mentioned earlier, we did not scale memory voltage. The power supplies---VDD and | ||
| 129 | +VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. | ||
| 129 | 130 | ||
| 130 | We first simulated 12 integer and 9 floating point SPEC CPU2006 | 131 | We first simulated 12 integer and 9 floating point SPEC CPU2006 |
| 131 | benchmarks~\cite{henning2006spec}, with each benchmark either running to | 132 | benchmarks~\cite{henning2006spec}, with each benchmark either running to |