Commit 0e5dc2a4c5e8d9723ccafbe755ff77c3269835e0

Authored by Rizwana Begum
1 parent 44d949a6

+ dave's comments

inefficiency.tex
@@ -180,13 +180,13 @@ consumption. @@ -180,13 +180,13 @@ consumption.
180 %CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm 180 %CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm
181 %(24\%)$. 181 %(24\%)$.
182 %%%%% END OF MODEL %%%%%% 182 %%%%% END OF MODEL %%%%%%
183 -In this work we demonstrate how to use inefficiency, deferring predicting and 183 +In this work we demonstrate how to use inefficiency and defer both predicting and
184 optimizing $E_{min}$ to future work. 184 optimizing $E_{min}$ to future work.
185 185
186 \subsection{Managing Inefficiency} 186 \subsection{Managing Inefficiency}
187 % 187 %
188 Future energy management algorithms need to tune system settings to keep the 188 Future energy management algorithms need to tune system settings to keep the
189 -system within specified inefficiency budget and deliver the best performance. 189 +system within the specified inefficiency budget and deliver the best performance.
190 % 190 %
191 Techniques that use predictors such as instructions-per-cycle (IPC) to decide 191 Techniques that use predictors such as instructions-per-cycle (IPC) to decide
192 when to use DVFS or migrate threads can be extended to operate under given 192 when to use DVFS or migrate threads can be extended to operate under given
@@ -202,7 +202,7 @@ under performance constraints, some have the potential to be modified to work @@ -202,7 +202,7 @@ under performance constraints, some have the potential to be modified to work
202 under energy constraints and thus could operate under 202 under energy constraints and thus could operate under
203 inefficiency budget~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}. 203 inefficiency budget~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}.
204 % 204 %
205 -We leave building some of these algorithms into a system as future work. 205 +We leave incorporating some of these algorithms into a system as future work.
206 % 206 %
207 In this paper, we characterize the optimal performance point under different 207 In this paper, we characterize the optimal performance point under different
208 inefficiency constraints and illustrate that the stability of these points 208 inefficiency constraints and illustrate that the stability of these points
inefficiency_speedup.tex
@@ -27,7 +27,7 @@ Figure~\ref{heatmaps} plots the speedup and inefficiency for three workloads @@ -27,7 +27,7 @@ Figure~\ref{heatmaps} plots the speedup and inefficiency for three workloads
27 operating with various CPU and memory frequencies. As the figure shows, the 27 operating with various CPU and memory frequencies. As the figure shows, the
28 ability of a workload to trade-off energy and performance using CPU and memory 28 ability of a workload to trade-off energy and performance using CPU and memory
29 frequency, depends on its mix of CPU and memory instructions. For CPU intensive 29 frequency, depends on its mix of CPU and memory instructions. For CPU intensive
30 -workloads like \textit{bzip2}, speedup varies with only CPU frequency, and 30 +workloads like \textit{bzip2}, speedup varies only with CPU frequency;
31 memory frequency has no impact on speedup. For workloads that have balanced CPU 31 memory frequency has no impact on speedup. For workloads that have balanced CPU
32 and memory intensive phases like \textit{gobmk}, speedup varies with both CPU 32 and memory intensive phases like \textit{gobmk}, speedup varies with both CPU
33 and memory frequency. The \textit{milc} benchmark has some memory intensive 33 and memory frequency. The \textit{milc} benchmark has some memory intensive
@@ -44,7 +44,7 @@ We make three major observations: @@ -44,7 +44,7 @@ We make three major observations:
44 efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and 44 efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and
45 memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much 45 memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much
46 that its overall energy consumption increases, thereby resulting in 46 that its overall energy consumption increases, thereby resulting in
47 -inefficiency of 1.55 for \textit{gobmk}. Algorithms that choose these frequency settings spend 47 +inefficiency of 1.55. Algorithms that choose these frequency settings spend
48 55\% more energy without any performance improvement. 48 55\% more energy without any performance improvement.
49 %The converse is also true 49 %The converse is also true
50 %as noted by our second observation. 50 %as noted by our second observation.
introduction.tex
@@ -19,10 +19,10 @@ Still other hardware energy-performance tradeoffs are on the horizon, arising @@ -19,10 +19,10 @@ Still other hardware energy-performance tradeoffs are on the horizon, arising
19 from capabilities such as memory frequency scaling~\cite{david2011memory} and nanosecond-speed DVFS 19 from capabilities such as memory frequency scaling~\cite{david2011memory} and nanosecond-speed DVFS
20 emerging in next-generation hardware designs~\cite{6084810}. 20 emerging in next-generation hardware designs~\cite{6084810}.
21 21
22 -We envision a next-generation smartphone capable of scaling both voltage and  
23 -frequency of CPU and only frequency of memory. 22 +We envision a next-generation smartphone capable of CPU DVFS (Dynamic Voltage
  23 +and Frequency Scaling) and memory DFS (Dynamic Frequency Scaling).
24 % 24 %
25 -While the addition of memory DVFS can be used to improve energy-constrained 25 +While the addition of memory DFS can be used to improve energy-constrained
26 performance, the larger frequency state space compared to CPU DVFS alone also 26 performance, the larger frequency state space compared to CPU DVFS alone also
27 provides more incorrect settings that waste energy or degrade performance. 27 provides more incorrect settings that waste energy or degrade performance.
28 % 28 %
@@ -33,7 +33,7 @@ energy constraints. @@ -33,7 +33,7 @@ energy constraints.
33 Our work represents two advances over previous efforts. 33 Our work represents two advances over previous efforts.
34 % 34 %
35 First, while previous works have explored energy minimizations using DVFS 35 First, while previous works have explored energy minimizations using DVFS
36 -under performance constraints focusing on reducing slack~\cite{deng2012coscale}, we are the first to 36 +under performance constraints focusing on reducing slack, we are the first to
37 study the potential DVFS settings under an energy constraint. 37 study the potential DVFS settings under an energy constraint.
38 % 38 %
39 Specifying performance constraints for servers is appropriate, since they are 39 Specifying performance constraints for servers is appropriate, since they are
@@ -75,7 +75,7 @@ performance. @@ -75,7 +75,7 @@ performance.
75 % 75 %
76 \item We study the energy-performance trade-offs of systems that are capable 76 \item We study the energy-performance trade-offs of systems that are capable
77 of both CPU and memory DVFS for multiple applications. We show that poor 77 of both CPU and memory DVFS for multiple applications. We show that poor
78 -frequency selection can both hurt performance and energy consumption. 78 +frequency selection can hurt both performance and energy consumption.
79 % 79 %
80 \item We characterize the optimal frequency settings for multiple 80 \item We characterize the optimal frequency settings for multiple
81 applications and inefficiency budgets. We introduce \textit{performance 81 applications and inefficiency budgets. We introduce \textit{performance
@@ -87,10 +87,10 @@ management algorithms. @@ -87,10 +87,10 @@ management algorithms.
87 % 87 %
88 \end{enumerate} 88 \end{enumerate}
89 89
90 -We use the \texttt{Gem5} simulator, the Android smartphone platform and Linux 90 +We use the \texttt{gem5} simulator, the Android smartphone platform and Linux
91 kernel, and an empirical power model to (1) measure the inefficiency of 91 kernel, and an empirical power model to (1) measure the inefficiency of
92 several applications for a wide range of frequency settings, (2) compute 92 several applications for a wide range of frequency settings, (2) compute
93 -performance clusters, and (3) study how they evolve. 93 +performance clusters, and (3) study how performance clusters evolve.
94 % 94 %
95 We are currently constructing a complete system to study tuning algorithms 95 We are currently constructing a complete system to study tuning algorithms
96 that can build on our insights to adaptively choose frequency settings at 96 that can build on our insights to adaptively choose frequency settings at
optimal_performance.tex
@@ -107,7 +107,7 @@ highest performance). For example, \textit{bzip2} is CPU bound and therefore @@ -107,7 +107,7 @@ highest performance). For example, \textit{bzip2} is CPU bound and therefore
107 its performance at memory frequency of 200MHz is within 3\% of performance at a 107 its performance at memory frequency of 200MHz is within 3\% of performance at a
108 memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that 108 memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that
109 3\% of performance, the system could have consumed 1/4 the memory background 109 3\% of performance, the system could have consumed 1/4 the memory background
110 -energy saving 2.7\% of the system energy and staying well under the given inefficiency budget. 110 +energy, saving 2.7\% of the system energy and staying well under the given inefficiency budget.
111 %\end{enumerate} 111 %\end{enumerate}
112 112
113 We believe that, if the user is willing to sacrifice some performance under 113 We believe that, if the user is willing to sacrifice some performance under
performance_clusters.tex
@@ -48,7 +48,7 @@ the system. @@ -48,7 +48,7 @@ the system.
48 48
49 \subsection{Performance Clusters} 49 \subsection{Performance Clusters}
50 We search for the performance clusters using an algorithm that is similar to the approach we used to find the optimal settings. We 50 We search for the performance clusters using an algorithm that is similar to the approach we used to find the optimal settings. We
51 -first filter the settings that fall within a given inefficiency budget, and 51 +first filter the settings that fall within a given inefficiency budget and
52 then search for the optimal settings in the first pass. In the second pass, we find all of the 52 then search for the optimal settings in the first pass. In the second pass, we find all of the
53 settings that have a speedup within the specified \textit{cluster threshold} of the optimal performance. 53 settings that have a speedup within the specified \textit{cluster threshold} of the optimal performance.
54 54
@@ -95,7 +95,7 @@ compromising performance by setting low inefficiency budgets to save energy. @@ -95,7 +95,7 @@ compromising performance by setting low inefficiency budgets to save energy.
95 95
96 Figures~\ref{clusters-gobmk}(c),~\ref{clusters-gobmk}(d) plot the 96 Figures~\ref{clusters-gobmk}(c),~\ref{clusters-gobmk}(d) plot the
97 performance clusters for \textit{gobmk} for inefficiency budget of 1.3 and 97 performance clusters for \textit{gobmk} for inefficiency budget of 1.3 and
98 -cluster thresholds of 1\% and 5\% respectively. As we saw in 98 +cluster thresholds of 1\% and 5\% respectively. As we observed in
99 Figure~\ref{gobmk-optimal}, the optimal settings for \textit{gobmk} change 99 Figure~\ref{gobmk-optimal}, the optimal settings for \textit{gobmk} change
100 every sample (of length 10 million instructions) and follows 100 every sample (of length 10 million instructions) and follows
101 application phases (CPI). Figure~\ref{clusters-gobmk}(c) shows that by 101 application phases (CPI). Figure~\ref{clusters-gobmk}(c) shows that by
@@ -118,7 +118,8 @@ Figures~\ref{clusters-gobmk}(a),~\ref{clusters-gobmk}(c) plot the performance @@ -118,7 +118,8 @@ Figures~\ref{clusters-gobmk}(a),~\ref{clusters-gobmk}(c) plot the performance
118 clusters for \textit{gobmk} for two different inefficiency budgets of 1.0 and 1.3 for 118 clusters for \textit{gobmk} for two different inefficiency budgets of 1.0 and 1.3 for
119 cluster threshold of 1\%. 119 cluster threshold of 1\%.
120 %\XXXnote{reword next sentence? -Dave} 120 %\XXXnote{reword next sentence? -Dave}
121 -Not all of the stable regions increase in length with increasing inefficiency but instead depends on the workload. 121 +Not all of the stable regions increase in length with increasing inefficiency;
  122 +this trend varies with workloads.
122 %Increase in the length of stable regions with increase in 123 %Increase in the length of stable regions with increase in
123 %inefficiency is a 124 %inefficiency is a
124 %function of workload characteristics. 125 %function of workload characteristics.
@@ -344,8 +345,8 @@ runs at one setting, sample 8-9 runs at another setting and sample 10 runs at a @@ -344,8 +345,8 @@ runs at one setting, sample 8-9 runs at another setting and sample 10 runs at a
344 different setting due to the availability of more (and better) choices. 345 different setting due to the availability of more (and better) choices.
345 %\XXXnote{sounds wordy -Dave}. 346 %\XXXnote{sounds wordy -Dave}.
346 In our system, we observed only a small improvement in performance (\textless 347 In our system, we observed only a small improvement in performance (\textless
347 -1\%) with higher number of frequency steps when  
348 -tuning is free as optimal 348 +1\%) with an increased number of frequency steps when
  349 +tuning is free, as optimal
349 settings in both cases were off by only a few MHz. It is the balance between the 350 settings in both cases were off by only a few MHz. It is the balance between the
350 tuning overhead and the energy-performance savings that is 351 tuning overhead and the energy-performance savings that is
351 critical in deciding the correct size of the search space. 352 critical in deciding the correct size of the search space.
system_methodology.tex
@@ -2,7 +2,7 @@ @@ -2,7 +2,7 @@
2 \label{sec-sys-methodology} 2 \label{sec-sys-methodology}
3 Energy management algorithms must tune the underlying hardware components to 3 Energy management algorithms must tune the underlying hardware components to
4 keep the system within the given inefficiency budget. Hardware components 4 keep the system within the given inefficiency budget. Hardware components
5 -provide multiple knobs that can be tuned to trade-off performance for energy 5 +provide multiple "knobs" that can be tuned to trade-off performance for energy
6 savings. For example, the energy consumed by the CPU can be managed by tuning 6 savings. For example, the energy consumed by the CPU can be managed by tuning
7 its frequency and voltage. 7 its frequency and voltage.
8 %DRAM energy can be 8 %DRAM energy can be
@@ -121,11 +121,12 @@ being 1.25V. @@ -121,11 +121,12 @@ being 1.25V.
121 %0.02V/30MHz. The voltage and frequency pairs match with the frequency steps 121 %0.02V/30MHz. The voltage and frequency pairs match with the frequency steps
122 %used by the Nexus S. 122 %used by the Nexus S.
123 123
124 -For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page  
125 -policy. Timing and current parameters for LPDDR3 are configured as specified in  
126 -data sheets from Micron~\cite{micronspec-url}. Memory clock domain is configured with a  
127 -frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory  
128 -voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. 124 +For the memory system, we simulated a LPDDR3 single channel, one rank memory
  125 +using an open-page access policy. Timing and current parameters for LPDDR3 are
  126 +configured as specified in data sheets from Micron~\cite{micronspec-url}. Memory
  127 +clock domain is configured with a frequency range of 200MHz to 800MHz. As
  128 +mentioned earlier, we did not scale memory voltage. The power supplies---VDD and
  129 +VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively.
129 130
130 We first simulated 12 integer and 9 floating point SPEC CPU2006 131 We first simulated 12 integer and 9 floating point SPEC CPU2006
131 benchmarks~\cite{henning2006spec}, with each benchmark either running to 132 benchmarks~\cite{henning2006spec}, with each benchmark either running to