Commit a160460b65da45fd7effab7c6cd7ed2e1c7085b1
1 parent
3b0c7aa8
Incorporated Mark's comments
Showing
8 changed files
with
62 additions
and
57 deletions
abstract.tex
| 1 | 1 | \begin{abstract} |
| 2 | 2 | |
| 3 | 3 | Battery lifetime continues to be a top complaint about smartphones. Dynamic |
| 4 | -voltage and frequency scaling (DVFS) has existed for mobile device CPUs for | |
| 5 | -some time, and can be used to dynamically trade off energy for performance. | |
| 6 | -To make more energy-performance tradeoffs possible, DVFS is beginning to be | |
| 7 | -applied to memory as well. | |
| 4 | +voltage and frequency scaling (DVFS) has existed for mobile device CPUs for some | |
| 5 | +time, and provides a tradeoff between energy and performance. DVFS is beginning | |
| 6 | +to be applied to memory as well to make more energy-performance tradeoffs | |
| 7 | +possible. | |
| 8 | 8 | |
| 9 | 9 | We present the first characterization of the behavior and optimal frequency |
| 10 | 10 | settings of workloads running both under \textit{energy constraints} and on | ... | ... |
acknowledgement.tex
0 → 100644
| 1 | +\section{Acknowledgement} | |
| 2 | +This material is based on work partially supported by NSF Collaborative Awards | |
| 3 | +CSR-1409014 and CSR-1409367. Any opinion, findings, and conclusions or | |
| 4 | +recommendations expressed in this material are those of the authors and do not | |
| 5 | +necessarily reflect the views of the National Science Foundation. | ... | ... |
inefficiency.tex
| ... | ... | @@ -7,24 +7,24 @@ management algorithms for mobile systems should optimize performance under |
| 7 | 7 | \textit{energy constraints}. |
| 8 | 8 | % |
| 9 | 9 | While several researchers have proposed algorithms that work under energy |
| 10 | -constraints~\cite{mobiheld09-cinder,ecosystem}, these approaches require that | |
| 11 | -the constraints are expressed in terms of absolute energy. | |
| 10 | +constraints, these approaches require that the constraints are expressed in | |
| 11 | +terms of absolute energy~\cite{mobiheld09-cinder,ecosystem}. | |
| 12 | 12 | % |
| 13 | -For example, rate-limiting approaches~\cite{mobiheld09-cinder} take the | |
| 14 | -maximum energy that can be consumed in a given time period as an input. | |
| 13 | +For example, rate-limiting approaches take the maximum energy that can be | |
| 14 | +consumed in a given time period as an input~\cite{mobiheld09-cinder}. | |
| 15 | 15 | % |
| 16 | 16 | Once the application consumes its limit, it is paused until the next time |
| 17 | 17 | period begins. |
| 18 | 18 | |
| 19 | -Unfortunately, in practice it is difficult to choose absolute energy | |
| 19 | +Unfortunately, in practice, it is difficult to choose absolute energy | |
| 20 | 20 | constraints appropriately for a diverse group of applications without |
| 21 | 21 | understanding their inherent energy needs. |
| 22 | 22 | % |
| 23 | 23 | Energy consumption varies across applications, devices, and operating |
| 24 | 24 | conditions, making it impractical to choose an absolute energy budget. |
| 25 | 25 | % |
| 26 | -Also, absolute energy constraints may slow down applications to the point | |
| 27 | -that total energy consumption \textit{increases} at the same time that | |
| 26 | +Also, applying absolute energy constraints may slow down applications to the | |
| 27 | +point that total energy consumption \textit{increases} and | |
| 28 | 28 | performance is degraded. |
| 29 | 29 | |
| 30 | 30 | Other metrics that incorporate energy take the form of $Energy * Delay^n$. |
| ... | ... | @@ -56,7 +56,7 @@ energy the application could have consumed ($E_{min}$) on the same device as |
| 56 | 56 | inefficiency: $I = \frac{E}{E_{min}}$. |
| 57 | 57 | % |
| 58 | 58 | An \textit{inefficiency} of $1$ represents an application's most efficient |
| 59 | -execution, while $1.5$ indicate the the application consumed $50\%$ more | |
| 59 | +execution, while $1.5$ indicates that the application consumed $50\%$ more | |
| 60 | 60 | energy that its most efficient execution. |
| 61 | 61 | % |
| 62 | 62 | Inefficiency is independent of workloads and devices and avoids the problems |
| ... | ... | @@ -86,7 +86,7 @@ We continue by addressing these questions. |
| 86 | 86 | % performance. |
| 87 | 87 | Devices will operate between an inefficiency of 1 and $I_{max}$ which |
| 88 | 88 | represents the unbounded energy constraint allowing the application to |
| 89 | -consume unbounded energy to deliver the best performance. | |
| 89 | +consume as much energy as necessary to deliver the best performance. | |
| 90 | 90 | % |
| 91 | 91 | $I_{max}$ depends upon applications and devices. |
| 92 | 92 | % |
| ... | ... | @@ -165,7 +165,7 @@ of instructions. |
| 165 | 165 | %We envision a system capable of scaling voltage and frequency of CPU and only |
| 166 | 166 | %frequency of DRAM. |
| 167 | 167 | Our models consider cross-component interactions on performance and energy. |
| 168 | -Performance model uses hardware performance counters to measure amount of time | |
| 168 | +The performance model uses hardware performance counters to measure amount of time | |
| 169 | 169 | each component is $Busy$ completing the work, $Idle$ stalled on the other |
| 170 | 170 | component and $Waiting$ for more work. We designed systematic methodology to |
| 171 | 171 | scale these states to estimate execution time of a given workload at different |
| ... | ... | @@ -198,8 +198,8 @@ system~\cite{david2011memory,deng2012multiscale,deng2011memscale,diniz2007limiti |
| 198 | 198 | % |
| 199 | 199 | While most of the existing multi-component energy management approaches work |
| 200 | 200 | under performance constraints, some have potential to be modified to work |
| 201 | -under energy constraints and thus | |
| 202 | -inefficiency~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}. | |
| 201 | +under energy constraints and thus could operate under | |
| 202 | +inefficiency budget~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}. | |
| 203 | 203 | % |
| 204 | 204 | We leave building some of these algorithms into a system as future work. |
| 205 | 205 | % | ... | ... |
inefficiency_speedup.tex
| ... | ... | @@ -5,7 +5,7 @@ Scaling individual components---CPU and memory---using DVFS has been studied in |
| 5 | 5 | the past |
| 6 | 6 | %and |
| 7 | 7 | %researchers have used it |
| 8 | -to make power performance trade-offs. To the best of our knowledge, the prior | |
| 8 | +to make power performance trade-offs. To the best of our knowledge, prior | |
| 9 | 9 | work has not studied the system level energy-performance trade-offs of combined |
| 10 | 10 | CPU and memory DVFS. |
| 11 | 11 | %considering the interaction between CPU and memory |
| ... | ... | @@ -43,8 +43,8 @@ We make three major observations: |
| 43 | 43 | \noindent \textit{Running slower doesn't mean that system is running |
| 44 | 44 | efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and |
| 45 | 45 | memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much |
| 46 | -that its overall energy consumption increases, thereby resulting in 1.55 | |
| 47 | -inefficiency for \textit{gobmk}. Algorithms that choose these frequency settings spend | |
| 46 | +that its overall energy consumption increases, thereby resulting in | |
| 47 | +inefficiency of 1.55 for \textit{gobmk}. Algorithms that choose these frequency settings spend | |
| 48 | 48 | 55\% more energy without any performance improvement. |
| 49 | 49 | %The converse is also true |
| 50 | 50 | %as noted by our second observation. |
| ... | ... | @@ -70,9 +70,9 @@ a) use no more than given inefficiency budget b) should use only as much |
| 70 | 70 | inefficiency budget as needed c) and deliver the best performance. |
| 71 | 71 | %\end{enumerate} |
| 72 | 72 | |
| 73 | -Consequently, like other constraints used by algorithms such as performance, power and absolute energy, $inefficiency$ | |
| 73 | +Consequently, like other constraints used by algorithms such as performance, power and absolute energy, inefficiency | |
| 74 | 74 | also allows energy management algorithms to waste system energy. We suggest |
| 75 | -that, even though $inefficiency$ doesn't completely eliminate the problem of | |
| 75 | +that, even though inefficiency doesn't completely eliminate the problem of | |
| 76 | 76 | wasting energy, it mitigates the problem. For example, rate limiting approaches |
| 77 | 77 | waste energy as energy budget is specified for a given amount of time interval |
| 78 | 78 | and doesn't require a specific amount of work to be done within that budget. | ... | ... |
optimal_performance.tex
| ... | ... | @@ -58,7 +58,7 @@ possible frequency settings under given inefficiency budget. It then finds the |
| 58 | 58 | CPU and memory frequency settings that result in highest speedup. In cases |
| 59 | 59 | where multiple settings result in similar speedup (within 0.5\%), to filter out |
| 60 | 60 | simulation noise, the algorithm selects the settings with highest CPU (first) |
| 61 | -and memory frequency as this setting is bound to have highest performance among | |
| 61 | +and then memory frequency as this setting is bound to have highest performance among | |
| 62 | 62 | the other possibilities. |
| 63 | 63 | |
| 64 | 64 | Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all |
| ... | ... | @@ -103,7 +103,7 @@ optimal settings for every sample may hinder some energy-performance trade-off |
| 103 | 103 | that could have been made if performance was not so tightly bounded (to only |
| 104 | 104 | highest performance). For example, \textit{bzip2} is CPU bound and therefore |
| 105 | 105 | its performance at memory frequency of 200MHz is within 3\% of performance at a |
| 106 | -memory frequency of 800MHz while CPU is running at 1000MHz. By sacrificing that | |
| 106 | +memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that | |
| 107 | 107 | 3\% of performance, the system could have consumed 1/4 the memory background |
| 108 | 108 | energy staying well under the given inefficiency budget. |
| 109 | 109 | %\end{enumerate} | ... | ... |
paper.tex
| ... | ... | @@ -81,6 +81,7 @@ Geoffrey Challen, Mark Hempstead} |
| 81 | 81 | \input{algorithm_implications.tex} |
| 82 | 82 | % 20 Apr 2015 : GWA : Add things here as needed. |
| 83 | 83 | \input{conclusions.tex} |
| 84 | +\input{acknowledgement.tex} | |
| 84 | 85 | |
| 85 | 86 | % 23 Sep 2014 : GWA : TODO : Reenable before submission. |
| 86 | 87 | ... | ... |
performance_clusters.tex
| ... | ... | @@ -122,8 +122,8 @@ Not all of the stable regions increase in length with increasing inefficiency bu |
| 122 | 122 | %inefficiency is a |
| 123 | 123 | %function of workload characteristics. |
| 124 | 124 | If consecutive |
| 125 | -samples of a workload have a small difference in performance but differ significantly in energy | |
| 126 | -consumption then only at | |
| 125 | +samples of a workload have a small difference in performance, but differ significantly in energy | |
| 126 | +consumption, then only at | |
| 127 | 127 | higher inefficiency budgets will the system find common settings for these |
| 128 | 128 | consecutive samples. % because all settings under an inefficiency budget are considered. |
| 129 | 129 | %Note that we find the performance clusters by considering |
| ... | ... | @@ -144,7 +144,7 @@ Figure~\ref{clusters-milc} shows that \textit{milc} has similar trends as |
| 144 | 144 | |
| 145 | 145 | An interesting observation from the performance clusters is that algorithms |
| 146 | 146 | like CoScale~\cite{deng2012coscale} that search for the best performing settings every interval starting |
| 147 | -from the maximum frequency settings are not optimal. Algorithms can reduce the | |
| 147 | +from the maximum frequency settings are not efficient. Algorithms can reduce the | |
| 148 | 148 | overhead of optimal settings search by starting search from the settings selected |
| 149 | 149 | for the previous interval as application phases are often stable for multiple sample intervals. |
| 150 | 150 | %as the application phases don't change drastically in |
| ... | ... | @@ -168,7 +168,7 @@ settings between the current sample performance cluster and the available |
| 168 | 168 | settings until the previous sample. When the algorithm finds no more common |
| 169 | 169 | samples, it marks the end of the stable region. If more than one frequency pair |
| 170 | 170 | exists in the available settings for this region, the algorithm chooses the |
| 171 | -setting with highest CPU (first) and memory frequency as optimal settings for this | |
| 171 | +setting with highest CPU (first) and then memory frequency as optimal settings for this | |
| 172 | 172 | region. Figure~\ref{lbm-stable-line-5-annotated} shows the CPU and memory frequency |
| 173 | 173 | settings selected for stable regions of benchmark \textit{lbm}. It also has |
| 174 | 174 | markers indicating the end of each stable region. In this figure, note that for |
| ... | ... | @@ -176,11 +176,11 @@ every stable region (between any two markers) the frequency of both CPU and memo |
| 176 | 176 | constant. |
| 177 | 177 | |
| 178 | 178 | %Note that |
| 179 | -Our algorithm is not practical for real systems, it knows the characteristics of the | |
| 179 | +Our algorithm is not practical for real systems, as it knows the characteristics of the | |
| 180 | 180 | future samples and their performance clusters in the beginning of a stable |
| 181 | 181 | region. % (and therefore is impractical to implement in real systems). |
| 182 | 182 | We are |
| 183 | -currently designing algorithms that are capable of tuning the system while | |
| 183 | +currently designing algorithms in hardware and software that are capable of tuning the system while | |
| 184 | 184 | running the application as future work. In Section~\ref{sec-algo-implications}, we |
| 185 | 185 | propose ways in which length of stable regions and the available settings for a |
| 186 | 186 | given region can be predicted for energy management algorithms in real systems. |
| ... | ... | @@ -260,7 +260,7 @@ across benchmarks for multiple cluster thresholds at inefficiency budget of 1.3. |
| 260 | 260 | |
| 261 | 261 | \subsection{Energy-Performance Trade-offs} |
| 262 | 262 | In this subsection we analyze the energy-performance trade-offs made by our |
| 263 | -ideal algorithm. We then add tuning cost of our algorithm and compare the | |
| 263 | +ideal algorithm. We then add the tuning cost of our algorithm and compare the | |
| 264 | 264 | energy performance trade-offs across multiple applications. We study multiple |
| 265 | 265 | cluster thresholds and an inefficiency budget of 1.3. |
| 266 | 266 | |
| ... | ... | @@ -300,9 +300,7 @@ frequency transitions. We assume tuning overhead |
| 300 | 300 | of 500us and 30uJ, which includes computing inefficiencies, searching for the |
| 301 | 301 | optimal setting and transition the hardware to new |
| 302 | 302 | settings~\cite{deng2012coscale}. We assumed that a space of 100 settings is |
| 303 | -searched for every transition. \textit{gobmk} is the only benchmark that shows a | |
| 304 | -performance improvement from the optimal settings when performance is allowed to | |
| 305 | -degrade, which is unexpected. We are investigating its root cause. | |
| 303 | +searched for every transition. | |
| 306 | 304 | %This is not intuitive and we are investigating the cause of this anomaly |
| 307 | 305 | %\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find |
| 308 | 306 | %the reason for the change}. |
| ... | ... | @@ -317,8 +315,8 @@ samples. This results in longer stable regions. |
| 317 | 315 | stable region. The longer the stable regions, the lower |
| 318 | 316 | the number of transitions that the system need to make. |
| 319 | 317 | \item Allowing a higher degradation in performance may, in fact, result in improved |
| 320 | -performance when tuning overhead of algorithms is included due to reduction in | |
| 321 | -number of frequency transitions in the system. Consequently energy savings also | |
| 318 | +performance when tuning overhead is included due to reduction in | |
| 319 | +number of frequency transitions in the system, consequently energy savings also | |
| 322 | 320 | increase. |
| 323 | 321 | \end{enumerate} |
| 324 | 322 | ... | ... |
system_methodology.tex
| ... | ... | @@ -12,7 +12,8 @@ Recent |
| 12 | 12 | research~\cite{david2011memory,deng2011memscale} has shown that DRAM frequency scaling |
| 13 | 13 | also provides performance and energy trade-offs. |
| 14 | 14 | |
| 15 | -In this work, we scale frequency and voltage for the CPU and scale only frequency for memory. | |
| 15 | +In this work, we scale frequency and voltage for the CPU and scale only | |
| 16 | +frequency for the memory~\cite{david2011memory,deng2011memscale}. | |
| 16 | 17 | %In this work, we scale frequency and voltage for the CPU and for the memory, scale frequency only. |
| 17 | 18 | %to make energy-performance trade-offs. |
| 18 | 19 | %Dynamic Voltage and |
| ... | ... | @@ -42,7 +43,7 @@ to perform our studies. |
| 42 | 43 | Current Gem5 versions provide the infrastructure necessary to change CPU |
| 43 | 44 | frequency and voltage; we extended Gem5 DVFS to incorporate memory frequency |
| 44 | 45 | scaling. As shown in Figure~\ref{fig-system-block-diag}, Gem5 provides a DVFS |
| 45 | -controller device that provides interface to control frequency by the OS at | |
| 46 | +controller device that provides an interface to control frequency by the OS at | |
| 46 | 47 | runtime. We developed a memory frequency governor similar to existing Linux CPU |
| 47 | 48 | frequency governors. |
| 48 | 49 | %that are capable of tuning memory frequency at runtime. |
| ... | ... | @@ -76,14 +77,14 @@ We developed energy models for the CPU and DRAM for our studies. Gem5 comes |
| 76 | 77 | with the energy models for various DRAM chipsets. The |
| 77 | 78 | DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the |
| 78 | 79 | memory energy consumption periodically during the benchmark execution. However, |
| 79 | -Gem5 lacks a model for CPU energy consumption. We developed a processor power | |
| 80 | +Gem5 lacks a model for CPU energy consumption. We developed a processor power | |
| 80 | 81 | model based on empirical measurements of a PandaBoard~\cite{pandaboard-url} |
| 81 | 82 | evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9 |
| 82 | 83 | processor; this chipset is used in the mobile platform we want to emulate, the |
| 83 | 84 | Samsung Nexus S. We ran microbenchmarks designed to stress the PandaBoard to |
| 84 | 85 | its full utilization and measured power consumed using an Agilent~34411A |
| 85 | 86 | multimeter. Because of the limitations of the platform, we could only measure |
| 86 | -peak dynamic power. Therefore to model different voltage levels we scaled it | |
| 87 | +peak dynamic power. Therefore, to model different voltage levels we scaled it | |
| 87 | 88 | quadratically with voltage and linear with frequency $(P{\propto}V^{2}f)$. Our |
| 88 | 89 | peak dynamic power agrees with the numbers reported by previous |
| 89 | 90 | work~\cite{poweragile-hotos11} and the datasheets. |
| ... | ... | @@ -105,22 +106,22 @@ proportional to supply voltage~\cite{leakage-islped02}. |
| 105 | 106 | %with our CPU power model to compute CPU energy consumption of the application at run time. |
| 106 | 107 | |
| 107 | 108 | \subsection{Experimental Methodology} |
| 108 | -Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run | |
| 109 | -on the Gem5 full system simulator. We model a Cortex-A9 processor, single core, | |
| 110 | -out-of-order CPU with an issue width of 8, L1 cache size of 64~KB with access | |
| 111 | -latency of 2 core cycles and a unified L2 cache of size 2~MB with hit latency of | |
| 112 | -12 core cycles. The CPU and caches operate under the same clock domain. For our | |
| 113 | -purposes, we have configured the CPU clock domain frequency to have a range of | |
| 114 | -100--1000~MHZ with highest voltage being 1.25V. | |
| 109 | +Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run on | |
| 110 | +the Gem5 full system simulator. We use default core configuration provided by | |
| 111 | +Gem5 in revision 10585, that is designed to reflect ARM Cortex-A15 processor | |
| 112 | +with L1 cache size of 64~KB with access latency of 2 core cycles and a unified | |
| 113 | +L2 cache of size 2~MB with hit latency of 12 core cycles. The CPU and caches | |
| 114 | +operate under the same clock domain. For our purposes, we have configured the | |
| 115 | +CPU clock domain frequency to have a range of 100--1000~MHZ with highest voltage | |
| 116 | +being 1.25V. | |
| 115 | 117 | % MH: This might confuse readers |
| 116 | -%Our | |
| 117 | -%experiments with a simple ring oscillator show that voltage changes by | |
| 118 | -%0.02V/30MHz. The voltage and frequency pairs match with the frequency steps used | |
| 119 | -%by the Nexus S. | |
| 118 | +%Our experiments with a simple ring oscillator show that voltage changes by | |
| 119 | +%0.02V/30MHz. The voltage and frequency pairs match with the frequency steps | |
| 120 | +%used by the Nexus S. | |
| 120 | 121 | |
| 121 | 122 | For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page |
| 122 | 123 | policy. Timing and current parameters for LPDDR3 are configured as specified in |
| 123 | -Micron data sheet~\cite{micronspec-url}. Memory clock domain is configured with a | |
| 124 | +data sheets from Micron~\cite{micronspec-url}. Memory clock domain is configured with a | |
| 124 | 125 | frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory |
| 125 | 126 | voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. |
| 126 | 127 | |
| ... | ... | @@ -137,10 +138,10 @@ benchmarks that have interesting and unique phases. |
| 137 | 138 | %selected benchmarks that have interesting and unique phases with finer |
| 138 | 139 | %frequency step granularity of 30MHz for CPU and 40MHz for memory, a total of |
| 139 | 140 | %496 settings. |
| 140 | -Due to limited resources and time, running simulations for all benchmarks with | |
| 141 | -finer frequency steps was difficult as it would have resulted in more than | |
| 142 | -10,000 simulations, where each simulation would take anywhere between 4 to 12 | |
| 143 | -hours. | |
| 141 | +%Due to limited resources and time, running simulations for all benchmarks with | |
| 142 | +%finer frequency steps was difficult as it would have resulted in more than | |
| 143 | +%10,000 simulations, where each simulation would take anywhere between 4 to 12 | |
| 144 | +%hours. | |
| 144 | 145 | |
| 145 | 146 | We collected samples of a fixed amount of work so that each sample would |
| 146 | 147 | represent the same work even across different frequencies. In Gem5, we collected |
| ... | ... | @@ -160,10 +161,10 @@ performance or energy. |
| 160 | 161 | Although individual energy-performance trade-offs of DVFS for CPU and |
| 161 | 162 | DFS for memory have been studied in the past, the trade-off resulting from |
| 162 | 163 | the cross-component interaction of these two components has not been |
| 163 | -characterized. CoScale~\cite{deng2012coscale} did point out that | |
| 164 | +characterized. CoScale~\cite{deng2012coscale} did point out that | |
| 164 | 165 | interplay of performance and energy consumption of these two |
| 165 | 166 | components is complex and did present a heuristic that attempts to |
| 166 | -pick the optimal point. However, it did not measure and characterize | |
| 167 | +pick the optimal point. In the next Section, we measure and characterize | |
| 167 | 168 | the larger space of all system level performance and energy trade-offs |
| 168 | 169 | of various CPU and memory frequency settings. |
| 169 | 170 | %In the next section, we study how performance and | ... | ... |