Commit 5ae0e5d5998b8b27833e594d0a83615a3a99552f
1 parent
0e5dc2a4
more edits
Showing
7 changed files
with
41 additions
and
41 deletions
abstract.tex
| ... | ... | @@ -2,13 +2,13 @@ |
| 2 | 2 | |
| 3 | 3 | Battery lifetime continues to be a top complaint about smartphones. Dynamic |
| 4 | 4 | voltage and frequency scaling (DVFS) has existed for mobile device CPUs for some |
| 5 | -time, and provides a tradeoff between energy and performance. DVFS is beginning | |
| 6 | -to be applied to memory as well to make more energy-performance tradeoffs | |
| 7 | -possible. | |
| 5 | +time, and provides a tradeoff between energy and performance. Dynamic frequency | |
| 6 | +scaling is beginning to be applied to memory as well to make more | |
| 7 | +energy-performance tradeoffs possible. | |
| 8 | 8 | |
| 9 | -We present the first characterization of the behavior and optimal frequency | |
| 10 | -settings of workloads running both under \textit{energy constraints} and on | |
| 11 | -systems with \textit{both} CPU and memory DVFS, an environment representative | |
| 9 | +We present the first characterization of the behavior of the optimal frequency | |
| 10 | +settings of workloads running both, under \textit{energy constraints} and on | |
| 11 | +systems capable of CPU DVFS and memory DFS, an environment representative | |
| 12 | 12 | of next-generation mobile devices. Our results show that continuously using |
| 13 | 13 | the optimal frequency settings results in a large number of frequency |
| 14 | 14 | transitions which end up hurting performance. However, by permitting a small | ... | ... |
inefficiency.tex
| ... | ... | @@ -7,7 +7,7 @@ management algorithms for mobile systems should optimize performance under |
| 7 | 7 | \textit{energy constraints}. |
| 8 | 8 | % |
| 9 | 9 | While several researchers have proposed algorithms that work under energy |
| 10 | -constraints, these approaches require that the constraints are expressed in | |
| 10 | +constraints, these approaches require that the constraints be expressed in | |
| 11 | 11 | terms of absolute energy~\cite{mobiheld09-cinder,ecosystem}. |
| 12 | 12 | % |
| 13 | 13 | For example, rate-limiting approaches take the maximum energy that can be |
| ... | ... | @@ -24,7 +24,7 @@ Energy consumption varies across applications, devices, and operating |
| 24 | 24 | conditions, making it impractical to choose an absolute energy budget. |
| 25 | 25 | % |
| 26 | 26 | Also, applying absolute energy constraints may slow down applications to the |
| 27 | -point that total energy consumption \textit{increases} and | |
| 27 | +point where total energy consumption \textit{increases} and | |
| 28 | 28 | performance is degraded. |
| 29 | 29 | |
| 30 | 30 | Other metrics that incorporate energy take the form of $Energy * Delay^n$. |
| ... | ... | @@ -34,7 +34,7 @@ We argue that while the energy-delay product can be used as a |
| 34 | 34 | \textit{constraint} to specify how much energy can be used to improve |
| 35 | 35 | performance. |
| 36 | 36 | % |
| 37 | -A effective constraint should be (1) relative to the applications inherent | |
| 37 | +An effective constraint should be (1) relative to the applications inherent | |
| 38 | 38 | energy needs and (2) independent of applications and devices. |
| 39 | 39 | % |
| 40 | 40 | Because it uses absolute energy, the energy-delay product meets neither of |
| ... | ... | @@ -57,7 +57,7 @@ inefficiency: $I = \frac{E}{E_{min}}$. |
| 57 | 57 | % |
| 58 | 58 | An \textit{inefficiency} of $1$ represents an application's most efficient |
| 59 | 59 | execution, while $1.5$ indicates that the application consumed $50\%$ more |
| 60 | -energy that its most efficient execution. | |
| 60 | +energy than its most efficient execution. | |
| 61 | 61 | % |
| 62 | 62 | Inefficiency is independent of workloads and devices and avoids the problems |
| 63 | 63 | inherent to absolute energy constraints. |
| ... | ... | @@ -143,7 +143,7 @@ We propose two methods for computing $E_{min}$: |
| 143 | 143 | |
| 144 | 144 | \item \textbf{Predicting and learning:} The overhead of the $E_{min}$ computation |
| 145 | 145 | can be further reduced by predicting $E_{min}$ based on previous observations |
| 146 | - and learning continuously. | |
| 146 | + and by continuous learning. | |
| 147 | 147 | % |
| 148 | 148 | A variety of learning based approaches~\cite{li2009machine} have |
| 149 | 149 | been proposed in the past to estimate various metrics and application phases | ... | ... |
inefficiency_speedup.tex
| ... | ... | @@ -7,7 +7,7 @@ the past |
| 7 | 7 | %researchers have used it |
| 8 | 8 | to make power performance trade-offs. To the best of our knowledge, prior |
| 9 | 9 | work has not studied the system level energy-performance trade-offs of combined |
| 10 | -CPU and memory DVFS. | |
| 10 | +CPU and memory frequency scaling. | |
| 11 | 11 | %considering the interaction between CPU and memory |
| 12 | 12 | %frequency scaling. |
| 13 | 13 | We take a first step and explore these trade-offs and show that incorrect |
| ... | ... | @@ -71,8 +71,8 @@ inefficiency budget as needed c) and deliver the best performance. |
| 71 | 71 | %\end{enumerate} |
| 72 | 72 | |
| 73 | 73 | Consequently, like other constraints used by algorithms such as performance, power and absolute energy, inefficiency |
| 74 | -also allows energy management algorithms to waste system energy. We suggest | |
| 75 | -that, even though inefficiency doesn't completely eliminate the problem of | |
| 74 | +also allows energy management algorithms to waste system energy. We argue | |
| 75 | +that, although inefficiency doesn't completely eliminate the problem of | |
| 76 | 76 | wasting energy, it mitigates the problem. For example, rate limiting approaches |
| 77 | 77 | waste energy as energy budget is specified for a given amount of time interval |
| 78 | 78 | and doesn't require a specific amount of work to be done within that budget. | ... | ... |
introduction.tex
| ... | ... | @@ -30,14 +30,14 @@ To better understand these systems, we characterize how the most performant |
| 30 | 30 | CPU and memory frequency settings change for multiple workloads under various |
| 31 | 31 | energy constraints. |
| 32 | 32 | |
| 33 | -Our work represents two advances over previous efforts. | |
| 33 | +Our work presents two advances over previous efforts. | |
| 34 | 34 | % |
| 35 | 35 | First, while previous works have explored energy minimizations using DVFS |
| 36 | 36 | under performance constraints focusing on reducing slack, we are the first to |
| 37 | 37 | study the potential DVFS settings under an energy constraint. |
| 38 | 38 | % |
| 39 | 39 | Specifying performance constraints for servers is appropriate, since they are |
| 40 | -both wall-powered and have terms of service that must be met. | |
| 40 | +both wall-powered and have quality of service constraints that must be met. | |
| 41 | 41 | % |
| 42 | 42 | Therefore, they do not have to and cannot afford to sacrifice too much |
| 43 | 43 | performance. |
| ... | ... | @@ -53,7 +53,7 @@ energy constraints and it is both application and device independent---unlike |
| 53 | 53 | existing metrics. |
| 54 | 54 | |
| 55 | 55 | Second, we are the first to characterize optimal frequency settings for |
| 56 | -systems providing both CPU and memory DVFS. | |
| 56 | +systems providing CPU DVFS and memory DFS. | |
| 57 | 57 | % |
| 58 | 58 | We find that closely tracking the optimal settings during execution produces |
| 59 | 59 | many transitions and large frequency transition overhead. |
| ... | ... | @@ -65,7 +65,7 @@ We characterize the relationship between the amount of performance loss and |
| 65 | 65 | the rate of tuning for several benchmarks, and introduce the concepts of |
| 66 | 66 | \textit{performance clusters} and \textit{stable regions} to aid the process. |
| 67 | 67 | |
| 68 | -We make following four contributions: | |
| 68 | +We make the following contributions: | |
| 69 | 69 | % |
| 70 | 70 | \begin{enumerate} |
| 71 | 71 | % |
| ... | ... | @@ -74,7 +74,7 @@ system to express the amount of extra energy that can be used to improve |
| 74 | 74 | performance. |
| 75 | 75 | % |
| 76 | 76 | \item We study the energy-performance trade-offs of systems that are capable |
| 77 | -of both CPU and memory DVFS for multiple applications. We show that poor | |
| 77 | +of CPU DVFS and memory DFS for multiple applications. We show that poor | |
| 78 | 78 | frequency selection can hurt both performance and energy consumption. |
| 79 | 79 | % |
| 80 | 80 | \item We characterize the optimal frequency settings for multiple |
| ... | ... | @@ -87,7 +87,7 @@ management algorithms. |
| 87 | 87 | % |
| 88 | 88 | \end{enumerate} |
| 89 | 89 | |
| 90 | -We use the \texttt{gem5} simulator, the Android smartphone platform and Linux | |
| 90 | +We use the \texttt{Gem5} simulator, the Android smartphone platform and Linux | |
| 91 | 91 | kernel, and an empirical power model to (1) measure the inefficiency of |
| 92 | 92 | several applications for a wide range of frequency settings, (2) compute |
| 93 | 93 | performance clusters, and (3) study how performance clusters evolve. |
| ... | ... | @@ -112,4 +112,4 @@ studies their characteristics. |
| 112 | 112 | % |
| 113 | 113 | Section~\ref{sec-algo-implications} presents implications of |
| 114 | 114 | using performance clusters on energy-management algorithms, and |
| 115 | -Section~\ref{sec-conclusions} concludes. | |
| 115 | +Section~\ref{sec-conclusions} summarizes and concludes the paper. | ... | ... |
optimal_performance.tex
| ... | ... | @@ -5,7 +5,7 @@ |
| 5 | 5 | \centering |
| 6 | 6 | \includegraphics[width=\columnwidth]{figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff_cpi_mpki.pdf} |
| 7 | 7 | \vspace{-0.5em} |
| 8 | -\caption{\textbf{Optimal Performance Point for \text{Gobmk} Across Inefficiencies:} At | |
| 8 | +\caption{\textbf{Optimal Performance Point for \textit{gobmk} Across Inefficiencies:} At | |
| 9 | 9 | low inefficiency budgets, the optimal frequency settings follow CPI of the |
| 10 | 10 | application, and select high memory frequencies for memory intensive phases. % with |
| 11 | 11 | %high CPI. |
| ... | ... | @@ -36,7 +36,7 @@ inefficiency budget is a function of workload.} |
| 36 | 36 | \end{subfigure}% |
| 37 | 37 | \vspace{0.5em} |
| 38 | 38 | \caption{\textbf{Performance Clusters of \textit{milc.}} |
| 39 | -\textit{Milc} is CPU intensive to a large extent with some memory intensive | |
| 39 | +\textit{milc} is CPU intensive to a large extent with some memory intensive | |
| 40 | 40 | phases. At higher thresholds, while CPU frequency is tightly bound, performance |
| 41 | 41 | clusters cover a wide range of memory settings due to small performance |
| 42 | 42 | difference across these frequencies. } |
| ... | ... | @@ -61,7 +61,7 @@ simulation noise, the algorithm selects the settings with highest CPU (first) |
| 61 | 61 | and then memory frequency as this setting is bound to have highest performance among |
| 62 | 62 | the other possibilities. |
| 63 | 63 | |
| 64 | -Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all | |
| 64 | +Figure~\ref{gobmk-optimal} plots the optimal settings for \textit{gobmk} for all | |
| 65 | 65 | benchmark samples (each of length 10~M instructions) across multiple |
| 66 | 66 | inefficiency constraints. At low inefficiencies, the optimal settings follow |
| 67 | 67 | the trends in CPI (cycles per instruction) and MPKI (misses per thousand | ... | ... |
performance_clusters.tex
| ... | ... | @@ -4,7 +4,7 @@ |
| 4 | 4 | \centering |
| 5 | 5 | \includegraphics[width=\columnwidth]{./figures/plots/496/stable_line_plots/lbm_stable_lineplot_annotated_5.pdf} |
| 6 | 6 | \vspace{-0.5em} |
| 7 | -\caption{\textbf{Stable Regions and Transitions for \textit{Lbm} with | |
| 7 | +\caption{\textbf{Stable Regions and Transitions for \textit{lbm} with | |
| 8 | 8 | Threshold of 5\% and Inefficiency Budget of 1.3:} Solid lines represent the |
| 9 | 9 | stable regions and vertical dashed lines mark the transitions made by |
| 10 | 10 | \textit{lbm}.} | ... | ... |
system_methodology.tex
| ... | ... | @@ -25,7 +25,7 @@ performance for energy savings. |
| 25 | 25 | %voltage could result in data corruption since the memory array itself is |
| 26 | 26 | %asynchronous. |
| 27 | 27 | As no current hardware systems support memory frequency scaling, |
| 28 | -we resort to Gem5~\cite{Binkert:gem5}, a cycle-accurate full system simulator | |
| 28 | +we resort to \texttt{Gem5}~\cite{Binkert:gem5}, a cycle-accurate full system simulator | |
| 29 | 29 | %as a platform |
| 30 | 30 | to perform our studies. |
| 31 | 31 | |
| ... | ... | @@ -34,21 +34,21 @@ to perform our studies. |
| 34 | 34 | \centering |
| 35 | 35 | \includegraphics[width=0.75\columnwidth]{./figures/plots/systemBlockDiagram.pdf} |
| 36 | 36 | \caption{\textbf{System Block Diagram}: Blocks that are newly added or |
| 37 | - significantly modified from Gem5 origin implementation are shaded.} | |
| 37 | + significantly modified from \texttt{Gem5} origin implementation are shaded.} | |
| 38 | 38 | \label{fig-system-block-diag} |
| 39 | 39 | \end{figure} |
| 40 | 40 | |
| 41 | 41 | %We envision a system that consists of a CPU capable of tuning its voltage and |
| 42 | 42 | %frequency and memory that supports frequency scaling. |
| 43 | -Current Gem5 versions provide the infrastructure necessary to change CPU | |
| 44 | -frequency and voltage; we extended Gem5 DVFS to incorporate memory frequency | |
| 45 | -scaling. As shown in Figure~\ref{fig-system-block-diag}, Gem5 provides a DVFS | |
| 43 | +Current \texttt{Gem5} versions provide the infrastructure necessary to change CPU | |
| 44 | +frequency and voltage; we extended \texttt{Gem5} DVFS to incorporate memory frequency | |
| 45 | +scaling. As shown in Figure~\ref{fig-system-block-diag}, \texttt{Gem5} provides a DVFS | |
| 46 | 46 | controller device that provides an interface to control frequency by the OS at |
| 47 | 47 | runtime. We developed a memory frequency governor similar to existing Linux CPU |
| 48 | 48 | frequency governors. Timing and current parameters of DRAM are scaled with its |
| 49 | 49 | frequency as described in the technical note from Micron~\cite{micronpower-TN-url}. |
| 50 | 50 | %that are capable of tuning memory frequency at runtime. |
| 51 | -The blocks that we added or significantly modified from Gem5's original | |
| 51 | +The blocks that we added or significantly modified from \texttt{Gem5}'s original | |
| 52 | 52 | implementation are shaded in Figure~\ref{fig-system-block-diag}. |
| 53 | 53 | |
| 54 | 54 | \begin{figure*}[t] |
| ... | ... | @@ -75,15 +75,15 @@ and degrade performance simultaneously.} |
| 75 | 75 | |
| 76 | 76 | \subsection{Energy Models} |
| 77 | 77 | \label{subsec-energy-models} |
| 78 | -We developed energy models for the CPU and DRAM for our studies. Gem5 comes | |
| 78 | +We developed energy models for the CPU and DRAM for our studies. \texttt{Gem5} comes | |
| 79 | 79 | with the energy models for various DRAM chipsets. The |
| 80 | -DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the | |
| 80 | +DRAMPower~\cite{drampower-tool} model is integrated into \texttt{Gem5} and computes the | |
| 81 | 81 | memory energy consumption periodically during the benchmark execution. However, |
| 82 | -Gem5 lacks a model for CPU energy consumption. We developed a processor power | |
| 82 | +\texttt{Gem5} lacks a model for CPU energy consumption. We developed a processor power | |
| 83 | 83 | model based on empirical measurements of a PandaBoard~\cite{pandaboard-url} |
| 84 | 84 | evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9 |
| 85 | 85 | processor; this chipset is used in the mobile platform we want to emulate, the |
| 86 | -Samsung Nexus S. We ran microbenchmarks designed to stress the PandaBoard to | |
| 86 | +Galaxy Nexus S. We ran microbenchmarks designed to stress the PandaBoard to | |
| 87 | 87 | its full utilization and measured power consumed using an Agilent~34411A |
| 88 | 88 | multimeter. Because of the limitations of the platform, we could only measure |
| 89 | 89 | peak dynamic power. Therefore, to model different voltage levels we scaled it |
| ... | ... | @@ -97,7 +97,7 @@ processor is not computing, but unlike leakage power, background power scales |
| 97 | 97 | with clock frequency. We measure background power by calculating the |
| 98 | 98 | difference between the CPU power consumption in its power on idle state and |
| 99 | 99 | deep sleep mode (not clocked). Because background power is clocked, it is |
| 100 | -scaled in a similar manner to dynamic power. Leakage power comprises up to | |
| 100 | +scaled in a similar manner to dynamic power. Leakage power comprises up to | |
| 101 | 101 | 30\% of microprocessor peak power consumption~\cite{power7} and is linearly |
| 102 | 102 | proportional to supply voltage~\cite{leakage-islped02}. |
| 103 | 103 | |
| ... | ... | @@ -109,8 +109,8 @@ proportional to supply voltage~\cite{leakage-islped02}. |
| 109 | 109 | |
| 110 | 110 | \subsection{Experimental Methodology} |
| 111 | 111 | Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run on |
| 112 | -the Gem5 full system simulator. We use default core configuration provided by | |
| 113 | -Gem5 in revision 10585, that is designed to reflect ARM Cortex-A15 processor | |
| 112 | +the \texttt{Gem5} full system simulator. We use default core configuration provided by | |
| 113 | +\texttt{Gem5} in revision 10585, that is designed to reflect ARM Cortex-A15 processor | |
| 114 | 114 | with L1 cache size of 64~KB with access latency of 2 core cycles and a unified |
| 115 | 115 | L2 cache of size 2~MB with hit latency of 12 core cycles. The CPU and caches |
| 116 | 116 | operate under the same clock domain. For our purposes, we have configured the |
| ... | ... | @@ -147,12 +147,12 @@ benchmarks that have interesting and unique phases. |
| 147 | 147 | %hours. |
| 148 | 148 | |
| 149 | 149 | We collected samples of a fixed amount of work so that each sample would |
| 150 | -represent the same work even across different frequencies. In Gem5, we collected | |
| 150 | +represent the same work even across different frequencies. In \texttt{Gem5}, we collected | |
| 151 | 151 | performance and energy consumption data every 10~million user mode |
| 152 | 152 | instructions. |
| 153 | 153 | %this fixed sample of work makes . |
| 154 | 154 | %By collecting data for a fixed amount of work (instructions) we are able to study frequency scaling for workloads; the alternative sampling in time . |
| 155 | -Gem5 provides a mechanism to distinguish between user mode and | |
| 155 | +\texttt{Gem5} provides a mechanism to distinguish between user mode and | |
| 156 | 156 | kernel mode instructions. We used this feature to remove periodic OS traffic and enable a fair comparison |
| 157 | 157 | across simulations of different CPU and memory frequencies. We used the collected |
| 158 | 158 | performance and energy data to study the impact of workload dynamics on the |
| ... | ... | @@ -162,7 +162,7 @@ a given inefficiency budget. Note that, all our studies are performed using |
| 162 | 162 | performance or energy. The interplay of performance and energy consumption of |
| 163 | 163 | CPU and memory frequency scaling is complex as pointed by |
| 164 | 164 | CoScale~\cite{deng2012coscale}. In the next Section, we measure and characterize |
| 165 | -the larger space of all system level performance and energy trade-offs | |
| 165 | +the larger space of system level performance and energy trade-offs | |
| 166 | 166 | of various CPU and memory frequency settings. |
| 167 | 167 | |
| 168 | 168 | %Although individual energy-performance trade-offs of DVFS for CPU and | ... | ... |