Commit a160460b65da45fd7effab7c6cd7ed2e1c7085b1

Authored by Rizwana Begum
1 parent 3b0c7aa8

Incorporated Mark's comments

abstract.tex
1 1 \begin{abstract}
2 2  
3 3 Battery lifetime continues to be a top complaint about smartphones. Dynamic
4   -voltage and frequency scaling (DVFS) has existed for mobile device CPUs for
5   -some time, and can be used to dynamically trade off energy for performance.
6   -To make more energy-performance tradeoffs possible, DVFS is beginning to be
7   -applied to memory as well.
  4 +voltage and frequency scaling (DVFS) has existed for mobile device CPUs for some
  5 +time, and provides a tradeoff between energy and performance. DVFS is beginning
  6 +to be applied to memory as well to make more energy-performance tradeoffs
  7 +possible.
8 8  
9 9 We present the first characterization of the behavior and optimal frequency
10 10 settings of workloads running both under \textit{energy constraints} and on
... ...
acknowledgement.tex 0 → 100644
  1 +\section{Acknowledgement}
  2 +This material is based on work partially supported by NSF Collaborative Awards
  3 +CSR-1409014 and CSR-1409367. Any opinion, findings, and conclusions or
  4 +recommendations expressed in this material are those of the authors and do not
  5 +necessarily reflect the views of the National Science Foundation.
... ...
inefficiency.tex
... ... @@ -7,24 +7,24 @@ management algorithms for mobile systems should optimize performance under
7 7 \textit{energy constraints}.
8 8 %
9 9 While several researchers have proposed algorithms that work under energy
10   -constraints~\cite{mobiheld09-cinder,ecosystem}, these approaches require that
11   -the constraints are expressed in terms of absolute energy.
  10 +constraints, these approaches require that the constraints are expressed in
  11 +terms of absolute energy~\cite{mobiheld09-cinder,ecosystem}.
12 12 %
13   -For example, rate-limiting approaches~\cite{mobiheld09-cinder} take the
14   -maximum energy that can be consumed in a given time period as an input.
  13 +For example, rate-limiting approaches take the maximum energy that can be
  14 +consumed in a given time period as an input~\cite{mobiheld09-cinder}.
15 15 %
16 16 Once the application consumes its limit, it is paused until the next time
17 17 period begins.
18 18  
19   -Unfortunately, in practice it is difficult to choose absolute energy
  19 +Unfortunately, in practice, it is difficult to choose absolute energy
20 20 constraints appropriately for a diverse group of applications without
21 21 understanding their inherent energy needs.
22 22 %
23 23 Energy consumption varies across applications, devices, and operating
24 24 conditions, making it impractical to choose an absolute energy budget.
25 25 %
26   -Also, absolute energy constraints may slow down applications to the point
27   -that total energy consumption \textit{increases} at the same time that
  26 +Also, applying absolute energy constraints may slow down applications to the
  27 +point that total energy consumption \textit{increases} and
28 28 performance is degraded.
29 29  
30 30 Other metrics that incorporate energy take the form of $Energy * Delay^n$.
... ... @@ -56,7 +56,7 @@ energy the application could have consumed ($E_{min}$) on the same device as
56 56 inefficiency: $I = \frac{E}{E_{min}}$.
57 57 %
58 58 An \textit{inefficiency} of $1$ represents an application's most efficient
59   -execution, while $1.5$ indicate the the application consumed $50\%$ more
  59 +execution, while $1.5$ indicates that the application consumed $50\%$ more
60 60 energy that its most efficient execution.
61 61 %
62 62 Inefficiency is independent of workloads and devices and avoids the problems
... ... @@ -86,7 +86,7 @@ We continue by addressing these questions.
86 86 % performance.
87 87 Devices will operate between an inefficiency of 1 and $I_{max}$ which
88 88 represents the unbounded energy constraint allowing the application to
89   -consume unbounded energy to deliver the best performance.
  89 +consume as much energy as necessary to deliver the best performance.
90 90 %
91 91 $I_{max}$ depends upon applications and devices.
92 92 %
... ... @@ -165,7 +165,7 @@ of instructions.
165 165 %We envision a system capable of scaling voltage and frequency of CPU and only
166 166 %frequency of DRAM.
167 167 Our models consider cross-component interactions on performance and energy.
168   -Performance model uses hardware performance counters to measure amount of time
  168 +The performance model uses hardware performance counters to measure amount of time
169 169 each component is $Busy$ completing the work, $Idle$ stalled on the other
170 170 component and $Waiting$ for more work. We designed systematic methodology to
171 171 scale these states to estimate execution time of a given workload at different
... ... @@ -198,8 +198,8 @@ system~\cite{david2011memory,deng2012multiscale,deng2011memscale,diniz2007limiti
198 198 %
199 199 While most of the existing multi-component energy management approaches work
200 200 under performance constraints, some have potential to be modified to work
201   -under energy constraints and thus
202   -inefficiency~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}.
  201 +under energy constraints and thus could operate under
  202 +inefficiency budget~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}.
203 203 %
204 204 We leave building some of these algorithms into a system as future work.
205 205 %
... ...
inefficiency_speedup.tex
... ... @@ -5,7 +5,7 @@ Scaling individual components---CPU and memory---using DVFS has been studied in
5 5 the past
6 6 %and
7 7 %researchers have used it
8   -to make power performance trade-offs. To the best of our knowledge, the prior
  8 +to make power performance trade-offs. To the best of our knowledge, prior
9 9 work has not studied the system level energy-performance trade-offs of combined
10 10 CPU and memory DVFS.
11 11 %considering the interaction between CPU and memory
... ... @@ -43,8 +43,8 @@ We make three major observations:
43 43 \noindent \textit{Running slower doesn't mean that system is running
44 44 efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and
45 45 memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much
46   -that its overall energy consumption increases, thereby resulting in 1.55
47   -inefficiency for \textit{gobmk}. Algorithms that choose these frequency settings spend
  46 +that its overall energy consumption increases, thereby resulting in
  47 +inefficiency of 1.55 for \textit{gobmk}. Algorithms that choose these frequency settings spend
48 48 55\% more energy without any performance improvement.
49 49 %The converse is also true
50 50 %as noted by our second observation.
... ... @@ -70,9 +70,9 @@ a) use no more than given inefficiency budget b) should use only as much
70 70 inefficiency budget as needed c) and deliver the best performance.
71 71 %\end{enumerate}
72 72  
73   -Consequently, like other constraints used by algorithms such as performance, power and absolute energy, $inefficiency$
  73 +Consequently, like other constraints used by algorithms such as performance, power and absolute energy, inefficiency
74 74 also allows energy management algorithms to waste system energy. We suggest
75   -that, even though $inefficiency$ doesn't completely eliminate the problem of
  75 +that, even though inefficiency doesn't completely eliminate the problem of
76 76 wasting energy, it mitigates the problem. For example, rate limiting approaches
77 77 waste energy as energy budget is specified for a given amount of time interval
78 78 and doesn't require a specific amount of work to be done within that budget.
... ...
optimal_performance.tex
... ... @@ -58,7 +58,7 @@ possible frequency settings under given inefficiency budget. It then finds the
58 58 CPU and memory frequency settings that result in highest speedup. In cases
59 59 where multiple settings result in similar speedup (within 0.5\%), to filter out
60 60 simulation noise, the algorithm selects the settings with highest CPU (first)
61   -and memory frequency as this setting is bound to have highest performance among
  61 +and then memory frequency as this setting is bound to have highest performance among
62 62 the other possibilities.
63 63  
64 64 Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all
... ... @@ -103,7 +103,7 @@ optimal settings for every sample may hinder some energy-performance trade-off
103 103 that could have been made if performance was not so tightly bounded (to only
104 104 highest performance). For example, \textit{bzip2} is CPU bound and therefore
105 105 its performance at memory frequency of 200MHz is within 3\% of performance at a
106   -memory frequency of 800MHz while CPU is running at 1000MHz. By sacrificing that
  106 +memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that
107 107 3\% of performance, the system could have consumed 1/4 the memory background
108 108 energy staying well under the given inefficiency budget.
109 109 %\end{enumerate}
... ...
paper.tex
... ... @@ -81,6 +81,7 @@ Geoffrey Challen, Mark Hempstead}
81 81 \input{algorithm_implications.tex}
82 82 % 20 Apr 2015 : GWA : Add things here as needed.
83 83 \input{conclusions.tex}
  84 +\input{acknowledgement.tex}
84 85  
85 86 % 23 Sep 2014 : GWA : TODO : Reenable before submission.
86 87  
... ...
performance_clusters.tex
... ... @@ -122,8 +122,8 @@ Not all of the stable regions increase in length with increasing inefficiency bu
122 122 %inefficiency is a
123 123 %function of workload characteristics.
124 124 If consecutive
125   -samples of a workload have a small difference in performance but differ significantly in energy
126   -consumption then only at
  125 +samples of a workload have a small difference in performance, but differ significantly in energy
  126 +consumption, then only at
127 127 higher inefficiency budgets will the system find common settings for these
128 128 consecutive samples. % because all settings under an inefficiency budget are considered.
129 129 %Note that we find the performance clusters by considering
... ... @@ -144,7 +144,7 @@ Figure~\ref{clusters-milc} shows that \textit{milc} has similar trends as
144 144  
145 145 An interesting observation from the performance clusters is that algorithms
146 146 like CoScale~\cite{deng2012coscale} that search for the best performing settings every interval starting
147   -from the maximum frequency settings are not optimal. Algorithms can reduce the
  147 +from the maximum frequency settings are not efficient. Algorithms can reduce the
148 148 overhead of optimal settings search by starting search from the settings selected
149 149 for the previous interval as application phases are often stable for multiple sample intervals.
150 150 %as the application phases don't change drastically in
... ... @@ -168,7 +168,7 @@ settings between the current sample performance cluster and the available
168 168 settings until the previous sample. When the algorithm finds no more common
169 169 samples, it marks the end of the stable region. If more than one frequency pair
170 170 exists in the available settings for this region, the algorithm chooses the
171   -setting with highest CPU (first) and memory frequency as optimal settings for this
  171 +setting with highest CPU (first) and then memory frequency as optimal settings for this
172 172 region. Figure~\ref{lbm-stable-line-5-annotated} shows the CPU and memory frequency
173 173 settings selected for stable regions of benchmark \textit{lbm}. It also has
174 174 markers indicating the end of each stable region. In this figure, note that for
... ... @@ -176,11 +176,11 @@ every stable region (between any two markers) the frequency of both CPU and memo
176 176 constant.
177 177  
178 178 %Note that
179   -Our algorithm is not practical for real systems, it knows the characteristics of the
  179 +Our algorithm is not practical for real systems, as it knows the characteristics of the
180 180 future samples and their performance clusters in the beginning of a stable
181 181 region. % (and therefore is impractical to implement in real systems).
182 182 We are
183   -currently designing algorithms that are capable of tuning the system while
  183 +currently designing algorithms in hardware and software that are capable of tuning the system while
184 184 running the application as future work. In Section~\ref{sec-algo-implications}, we
185 185 propose ways in which length of stable regions and the available settings for a
186 186 given region can be predicted for energy management algorithms in real systems.
... ... @@ -260,7 +260,7 @@ across benchmarks for multiple cluster thresholds at inefficiency budget of 1.3.
260 260  
261 261 \subsection{Energy-Performance Trade-offs}
262 262 In this subsection we analyze the energy-performance trade-offs made by our
263   -ideal algorithm. We then add tuning cost of our algorithm and compare the
  263 +ideal algorithm. We then add the tuning cost of our algorithm and compare the
264 264 energy performance trade-offs across multiple applications. We study multiple
265 265 cluster thresholds and an inefficiency budget of 1.3.
266 266  
... ... @@ -300,9 +300,7 @@ frequency transitions. We assume tuning overhead
300 300 of 500us and 30uJ, which includes computing inefficiencies, searching for the
301 301 optimal setting and transition the hardware to new
302 302 settings~\cite{deng2012coscale}. We assumed that a space of 100 settings is
303   -searched for every transition. \textit{gobmk} is the only benchmark that shows a
304   -performance improvement from the optimal settings when performance is allowed to
305   -degrade, which is unexpected. We are investigating its root cause.
  303 +searched for every transition.
306 304 %This is not intuitive and we are investigating the cause of this anomaly
307 305 %\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find
308 306 %the reason for the change}.
... ... @@ -317,8 +315,8 @@ samples. This results in longer stable regions.
317 315 stable region. The longer the stable regions, the lower
318 316 the number of transitions that the system need to make.
319 317 \item Allowing a higher degradation in performance may, in fact, result in improved
320   -performance when tuning overhead of algorithms is included due to reduction in
321   -number of frequency transitions in the system. Consequently energy savings also
  318 +performance when tuning overhead is included due to reduction in
  319 +number of frequency transitions in the system, consequently energy savings also
322 320 increase.
323 321 \end{enumerate}
324 322  
... ...
system_methodology.tex
... ... @@ -12,7 +12,8 @@ Recent
12 12 research~\cite{david2011memory,deng2011memscale} has shown that DRAM frequency scaling
13 13 also provides performance and energy trade-offs.
14 14  
15   -In this work, we scale frequency and voltage for the CPU and scale only frequency for memory.
  15 +In this work, we scale frequency and voltage for the CPU and scale only
  16 +frequency for the memory~\cite{david2011memory,deng2011memscale}.
16 17 %In this work, we scale frequency and voltage for the CPU and for the memory, scale frequency only.
17 18 %to make energy-performance trade-offs.
18 19 %Dynamic Voltage and
... ... @@ -42,7 +43,7 @@ to perform our studies.
42 43 Current Gem5 versions provide the infrastructure necessary to change CPU
43 44 frequency and voltage; we extended Gem5 DVFS to incorporate memory frequency
44 45 scaling. As shown in Figure~\ref{fig-system-block-diag}, Gem5 provides a DVFS
45   -controller device that provides interface to control frequency by the OS at
  46 +controller device that provides an interface to control frequency by the OS at
46 47 runtime. We developed a memory frequency governor similar to existing Linux CPU
47 48 frequency governors.
48 49 %that are capable of tuning memory frequency at runtime.
... ... @@ -76,14 +77,14 @@ We developed energy models for the CPU and DRAM for our studies. Gem5 comes
76 77 with the energy models for various DRAM chipsets. The
77 78 DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the
78 79 memory energy consumption periodically during the benchmark execution. However,
79   -Gem5 lacks a model for CPU energy consumption. We developed a processor power
  80 +Gem5 lacks a model for CPU energy consumption. We developed a processor power
80 81 model based on empirical measurements of a PandaBoard~\cite{pandaboard-url}
81 82 evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9
82 83 processor; this chipset is used in the mobile platform we want to emulate, the
83 84 Samsung Nexus S. We ran microbenchmarks designed to stress the PandaBoard to
84 85 its full utilization and measured power consumed using an Agilent~34411A
85 86 multimeter. Because of the limitations of the platform, we could only measure
86   -peak dynamic power. Therefore to model different voltage levels we scaled it
  87 +peak dynamic power. Therefore, to model different voltage levels we scaled it
87 88 quadratically with voltage and linear with frequency $(P{\propto}V^{2}f)$. Our
88 89 peak dynamic power agrees with the numbers reported by previous
89 90 work~\cite{poweragile-hotos11} and the datasheets.
... ... @@ -105,22 +106,22 @@ proportional to supply voltage~\cite{leakage-islped02}.
105 106 %with our CPU power model to compute CPU energy consumption of the application at run time.
106 107  
107 108 \subsection{Experimental Methodology}
108   -Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run
109   -on the Gem5 full system simulator. We model a Cortex-A9 processor, single core,
110   -out-of-order CPU with an issue width of 8, L1 cache size of 64~KB with access
111   -latency of 2 core cycles and a unified L2 cache of size 2~MB with hit latency of
112   -12 core cycles. The CPU and caches operate under the same clock domain. For our
113   -purposes, we have configured the CPU clock domain frequency to have a range of
114   -100--1000~MHZ with highest voltage being 1.25V.
  109 +Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run on
  110 +the Gem5 full system simulator. We use default core configuration provided by
  111 +Gem5 in revision 10585, that is designed to reflect ARM Cortex-A15 processor
  112 +with L1 cache size of 64~KB with access latency of 2 core cycles and a unified
  113 +L2 cache of size 2~MB with hit latency of 12 core cycles. The CPU and caches
  114 +operate under the same clock domain. For our purposes, we have configured the
  115 +CPU clock domain frequency to have a range of 100--1000~MHZ with highest voltage
  116 +being 1.25V.
115 117 % MH: This might confuse readers
116   -%Our
117   -%experiments with a simple ring oscillator show that voltage changes by
118   -%0.02V/30MHz. The voltage and frequency pairs match with the frequency steps used
119   -%by the Nexus S.
  118 +%Our experiments with a simple ring oscillator show that voltage changes by
  119 +%0.02V/30MHz. The voltage and frequency pairs match with the frequency steps
  120 +%used by the Nexus S.
120 121  
121 122 For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page
122 123 policy. Timing and current parameters for LPDDR3 are configured as specified in
123   -Micron data sheet~\cite{micronspec-url}. Memory clock domain is configured with a
  124 +data sheets from Micron~\cite{micronspec-url}. Memory clock domain is configured with a
124 125 frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory
125 126 voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively.
126 127  
... ... @@ -137,10 +138,10 @@ benchmarks that have interesting and unique phases.
137 138 %selected benchmarks that have interesting and unique phases with finer
138 139 %frequency step granularity of 30MHz for CPU and 40MHz for memory, a total of
139 140 %496 settings.
140   -Due to limited resources and time, running simulations for all benchmarks with
141   -finer frequency steps was difficult as it would have resulted in more than
142   -10,000 simulations, where each simulation would take anywhere between 4 to 12
143   -hours.
  141 +%Due to limited resources and time, running simulations for all benchmarks with
  142 +%finer frequency steps was difficult as it would have resulted in more than
  143 +%10,000 simulations, where each simulation would take anywhere between 4 to 12
  144 +%hours.
144 145  
145 146 We collected samples of a fixed amount of work so that each sample would
146 147 represent the same work even across different frequencies. In Gem5, we collected
... ... @@ -160,10 +161,10 @@ performance or energy.
160 161 Although individual energy-performance trade-offs of DVFS for CPU and
161 162 DFS for memory have been studied in the past, the trade-off resulting from
162 163 the cross-component interaction of these two components has not been
163   -characterized. CoScale~\cite{deng2012coscale} did point out that
  164 +characterized. CoScale~\cite{deng2012coscale} did point out that
164 165 interplay of performance and energy consumption of these two
165 166 components is complex and did present a heuristic that attempts to
166   -pick the optimal point. However, it did not measure and characterize
  167 +pick the optimal point. In the next Section, we measure and characterize
167 168 the larger space of all system level performance and energy trade-offs
168 169 of various CPU and memory frequency settings.
169 170 %In the next section, we study how performance and
... ...