Commit 0e5dc2a4c5e8d9723ccafbe755ff77c3269835e0

Authored by Rizwana Begum
1 parent 44d949a6

+ dave's comments

inefficiency.tex
... ... @@ -180,13 +180,13 @@ consumption.
180 180 %CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm
181 181 %(24\%)$.
182 182 %%%%% END OF MODEL %%%%%%
183   -In this work we demonstrate how to use inefficiency, deferring predicting and
  183 +In this work we demonstrate how to use inefficiency and defer both predicting and
184 184 optimizing $E_{min}$ to future work.
185 185  
186 186 \subsection{Managing Inefficiency}
187 187 %
188 188 Future energy management algorithms need to tune system settings to keep the
189   -system within specified inefficiency budget and deliver the best performance.
  189 +system within the specified inefficiency budget and deliver the best performance.
190 190 %
191 191 Techniques that use predictors such as instructions-per-cycle (IPC) to decide
192 192 when to use DVFS or migrate threads can be extended to operate under given
... ... @@ -202,7 +202,7 @@ under performance constraints, some have the potential to be modified to work
202 202 under energy constraints and thus could operate under
203 203 inefficiency budget~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}.
204 204 %
205   -We leave building some of these algorithms into a system as future work.
  205 +We leave incorporating some of these algorithms into a system as future work.
206 206 %
207 207 In this paper, we characterize the optimal performance point under different
208 208 inefficiency constraints and illustrate that the stability of these points
... ...
inefficiency_speedup.tex
... ... @@ -27,7 +27,7 @@ Figure~\ref{heatmaps} plots the speedup and inefficiency for three workloads
27 27 operating with various CPU and memory frequencies. As the figure shows, the
28 28 ability of a workload to trade-off energy and performance using CPU and memory
29 29 frequency, depends on its mix of CPU and memory instructions. For CPU intensive
30   -workloads like \textit{bzip2}, speedup varies with only CPU frequency, and
  30 +workloads like \textit{bzip2}, speedup varies only with CPU frequency;
31 31 memory frequency has no impact on speedup. For workloads that have balanced CPU
32 32 and memory intensive phases like \textit{gobmk}, speedup varies with both CPU
33 33 and memory frequency. The \textit{milc} benchmark has some memory intensive
... ... @@ -44,7 +44,7 @@ We make three major observations:
44 44 efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and
45 45 memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much
46 46 that its overall energy consumption increases, thereby resulting in
47   -inefficiency of 1.55 for \textit{gobmk}. Algorithms that choose these frequency settings spend
  47 +inefficiency of 1.55. Algorithms that choose these frequency settings spend
48 48 55\% more energy without any performance improvement.
49 49 %The converse is also true
50 50 %as noted by our second observation.
... ...
introduction.tex
... ... @@ -19,10 +19,10 @@ Still other hardware energy-performance tradeoffs are on the horizon, arising
19 19 from capabilities such as memory frequency scaling~\cite{david2011memory} and nanosecond-speed DVFS
20 20 emerging in next-generation hardware designs~\cite{6084810}.
21 21  
22   -We envision a next-generation smartphone capable of scaling both voltage and
23   -frequency of CPU and only frequency of memory.
  22 +We envision a next-generation smartphone capable of CPU DVFS (Dynamic Voltage
  23 +and Frequency Scaling) and memory DFS (Dynamic Frequency Scaling).
24 24 %
25   -While the addition of memory DVFS can be used to improve energy-constrained
  25 +While the addition of memory DFS can be used to improve energy-constrained
26 26 performance, the larger frequency state space compared to CPU DVFS alone also
27 27 provides more incorrect settings that waste energy or degrade performance.
28 28 %
... ... @@ -33,7 +33,7 @@ energy constraints.
33 33 Our work represents two advances over previous efforts.
34 34 %
35 35 First, while previous works have explored energy minimizations using DVFS
36   -under performance constraints focusing on reducing slack~\cite{deng2012coscale}, we are the first to
  36 +under performance constraints focusing on reducing slack, we are the first to
37 37 study the potential DVFS settings under an energy constraint.
38 38 %
39 39 Specifying performance constraints for servers is appropriate, since they are
... ... @@ -75,7 +75,7 @@ performance.
75 75 %
76 76 \item We study the energy-performance trade-offs of systems that are capable
77 77 of both CPU and memory DVFS for multiple applications. We show that poor
78   -frequency selection can both hurt performance and energy consumption.
  78 +frequency selection can hurt both performance and energy consumption.
79 79 %
80 80 \item We characterize the optimal frequency settings for multiple
81 81 applications and inefficiency budgets. We introduce \textit{performance
... ... @@ -87,10 +87,10 @@ management algorithms.
87 87 %
88 88 \end{enumerate}
89 89  
90   -We use the \texttt{Gem5} simulator, the Android smartphone platform and Linux
  90 +We use the \texttt{gem5} simulator, the Android smartphone platform and Linux
91 91 kernel, and an empirical power model to (1) measure the inefficiency of
92 92 several applications for a wide range of frequency settings, (2) compute
93   -performance clusters, and (3) study how they evolve.
  93 +performance clusters, and (3) study how performance clusters evolve.
94 94 %
95 95 We are currently constructing a complete system to study tuning algorithms
96 96 that can build on our insights to adaptively choose frequency settings at
... ...
optimal_performance.tex
... ... @@ -107,7 +107,7 @@ highest performance). For example, \textit{bzip2} is CPU bound and therefore
107 107 its performance at memory frequency of 200MHz is within 3\% of performance at a
108 108 memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that
109 109 3\% of performance, the system could have consumed 1/4 the memory background
110   -energy saving 2.7\% of the system energy and staying well under the given inefficiency budget.
  110 +energy, saving 2.7\% of the system energy and staying well under the given inefficiency budget.
111 111 %\end{enumerate}
112 112  
113 113 We believe that, if the user is willing to sacrifice some performance under
... ...
performance_clusters.tex
... ... @@ -48,7 +48,7 @@ the system.
48 48  
49 49 \subsection{Performance Clusters}
50 50 We search for the performance clusters using an algorithm that is similar to the approach we used to find the optimal settings. We
51   -first filter the settings that fall within a given inefficiency budget, and
  51 +first filter the settings that fall within a given inefficiency budget and
52 52 then search for the optimal settings in the first pass. In the second pass, we find all of the
53 53 settings that have a speedup within the specified \textit{cluster threshold} of the optimal performance.
54 54  
... ... @@ -95,7 +95,7 @@ compromising performance by setting low inefficiency budgets to save energy.
95 95  
96 96 Figures~\ref{clusters-gobmk}(c),~\ref{clusters-gobmk}(d) plot the
97 97 performance clusters for \textit{gobmk} for inefficiency budget of 1.3 and
98   -cluster thresholds of 1\% and 5\% respectively. As we saw in
  98 +cluster thresholds of 1\% and 5\% respectively. As we observed in
99 99 Figure~\ref{gobmk-optimal}, the optimal settings for \textit{gobmk} change
100 100 every sample (of length 10 million instructions) and follows
101 101 application phases (CPI). Figure~\ref{clusters-gobmk}(c) shows that by
... ... @@ -118,7 +118,8 @@ Figures~\ref{clusters-gobmk}(a),~\ref{clusters-gobmk}(c) plot the performance
118 118 clusters for \textit{gobmk} for two different inefficiency budgets of 1.0 and 1.3 for
119 119 cluster threshold of 1\%.
120 120 %\XXXnote{reword next sentence? -Dave}
121   -Not all of the stable regions increase in length with increasing inefficiency but instead depends on the workload.
  121 +Not all of the stable regions increase in length with increasing inefficiency;
  122 +this trend varies with workloads.
122 123 %Increase in the length of stable regions with increase in
123 124 %inefficiency is a
124 125 %function of workload characteristics.
... ... @@ -344,8 +345,8 @@ runs at one setting, sample 8-9 runs at another setting and sample 10 runs at a
344 345 different setting due to the availability of more (and better) choices.
345 346 %\XXXnote{sounds wordy -Dave}.
346 347 In our system, we observed only a small improvement in performance (\textless
347   -1\%) with higher number of frequency steps when
348   -tuning is free as optimal
  348 +1\%) with an increased number of frequency steps when
  349 +tuning is free, as optimal
349 350 settings in both cases were off by only a few MHz. It is the balance between the
350 351 tuning overhead and the energy-performance savings that is
351 352 critical in deciding the correct size of the search space.
... ...
system_methodology.tex
... ... @@ -2,7 +2,7 @@
2 2 \label{sec-sys-methodology}
3 3 Energy management algorithms must tune the underlying hardware components to
4 4 keep the system within the given inefficiency budget. Hardware components
5   -provide multiple knobs that can be tuned to trade-off performance for energy
  5 +provide multiple "knobs" that can be tuned to trade-off performance for energy
6 6 savings. For example, the energy consumed by the CPU can be managed by tuning
7 7 its frequency and voltage.
8 8 %DRAM energy can be
... ... @@ -121,11 +121,12 @@ being 1.25V.
121 121 %0.02V/30MHz. The voltage and frequency pairs match with the frequency steps
122 122 %used by the Nexus S.
123 123  
124   -For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page
125   -policy. Timing and current parameters for LPDDR3 are configured as specified in
126   -data sheets from Micron~\cite{micronspec-url}. Memory clock domain is configured with a
127   -frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory
128   -voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively.
  124 +For the memory system, we simulated a LPDDR3 single channel, one rank memory
  125 +using an open-page access policy. Timing and current parameters for LPDDR3 are
  126 +configured as specified in data sheets from Micron~\cite{micronspec-url}. Memory
  127 +clock domain is configured with a frequency range of 200MHz to 800MHz. As
  128 +mentioned earlier, we did not scale memory voltage. The power supplies---VDD and
  129 +VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively.
129 130  
130 131 We first simulated 12 integer and 9 floating point SPEC CPU2006
131 132 benchmarks~\cite{henning2006spec}, with each benchmark either running to
... ...