diff --git a/abstract.tex b/abstract.tex index a5d584c..045f5f0 100644 --- a/abstract.tex +++ b/abstract.tex @@ -2,13 +2,13 @@ Battery lifetime continues to be a top complaint about smartphones. Dynamic voltage and frequency scaling (DVFS) has existed for mobile device CPUs for some -time, and provides a tradeoff between energy and performance. DVFS is beginning -to be applied to memory as well to make more energy-performance tradeoffs -possible. +time, and provides a tradeoff between energy and performance. Dynamic frequency +scaling is beginning to be applied to memory as well to make more +energy-performance tradeoffs possible. -We present the first characterization of the behavior and optimal frequency -settings of workloads running both under \textit{energy constraints} and on -systems with \textit{both} CPU and memory DVFS, an environment representative +We present the first characterization of the behavior of the optimal frequency +settings of workloads running both, under \textit{energy constraints} and on +systems capable of CPU DVFS and memory DFS, an environment representative of next-generation mobile devices. Our results show that continuously using the optimal frequency settings results in a large number of frequency transitions which end up hurting performance. However, by permitting a small diff --git a/inefficiency.tex b/inefficiency.tex index c3f8dfc..f4f82cd 100644 --- a/inefficiency.tex +++ b/inefficiency.tex @@ -7,7 +7,7 @@ management algorithms for mobile systems should optimize performance under \textit{energy constraints}. % While several researchers have proposed algorithms that work under energy -constraints, these approaches require that the constraints are expressed in +constraints, these approaches require that the constraints be expressed in terms of absolute energy~\cite{mobiheld09-cinder,ecosystem}. % For example, rate-limiting approaches take the maximum energy that can be @@ -24,7 +24,7 @@ Energy consumption varies across applications, devices, and operating conditions, making it impractical to choose an absolute energy budget. % Also, applying absolute energy constraints may slow down applications to the -point that total energy consumption \textit{increases} and +point where total energy consumption \textit{increases} and performance is degraded. Other metrics that incorporate energy take the form of $Energy * Delay^n$. @@ -34,7 +34,7 @@ We argue that while the energy-delay product can be used as a \textit{constraint} to specify how much energy can be used to improve performance. % -A effective constraint should be (1) relative to the applications inherent +An effective constraint should be (1) relative to the applications inherent energy needs and (2) independent of applications and devices. % Because it uses absolute energy, the energy-delay product meets neither of @@ -57,7 +57,7 @@ inefficiency: $I = \frac{E}{E_{min}}$. % An \textit{inefficiency} of $1$ represents an application's most efficient execution, while $1.5$ indicates that the application consumed $50\%$ more -energy that its most efficient execution. +energy than its most efficient execution. % Inefficiency is independent of workloads and devices and avoids the problems inherent to absolute energy constraints. @@ -143,7 +143,7 @@ We propose two methods for computing $E_{min}$: \item \textbf{Predicting and learning:} The overhead of the $E_{min}$ computation can be further reduced by predicting $E_{min}$ based on previous observations - and learning continuously. + and by continuous learning. % A variety of learning based approaches~\cite{li2009machine} have been proposed in the past to estimate various metrics and application phases diff --git a/inefficiency_speedup.tex b/inefficiency_speedup.tex index 864cfe6..7aa8e0b 100644 --- a/inefficiency_speedup.tex +++ b/inefficiency_speedup.tex @@ -7,7 +7,7 @@ the past %researchers have used it to make power performance trade-offs. To the best of our knowledge, prior work has not studied the system level energy-performance trade-offs of combined -CPU and memory DVFS. +CPU and memory frequency scaling. %considering the interaction between CPU and memory %frequency scaling. We take a first step and explore these trade-offs and show that incorrect @@ -71,8 +71,8 @@ inefficiency budget as needed c) and deliver the best performance. %\end{enumerate} Consequently, like other constraints used by algorithms such as performance, power and absolute energy, inefficiency -also allows energy management algorithms to waste system energy. We suggest -that, even though inefficiency doesn't completely eliminate the problem of +also allows energy management algorithms to waste system energy. We argue +that, although inefficiency doesn't completely eliminate the problem of wasting energy, it mitigates the problem. For example, rate limiting approaches waste energy as energy budget is specified for a given amount of time interval and doesn't require a specific amount of work to be done within that budget. diff --git a/introduction.tex b/introduction.tex index b3cce6f..6d91c9c 100644 --- a/introduction.tex +++ b/introduction.tex @@ -30,14 +30,14 @@ To better understand these systems, we characterize how the most performant CPU and memory frequency settings change for multiple workloads under various energy constraints. -Our work represents two advances over previous efforts. +Our work presents two advances over previous efforts. % First, while previous works have explored energy minimizations using DVFS under performance constraints focusing on reducing slack, we are the first to study the potential DVFS settings under an energy constraint. % Specifying performance constraints for servers is appropriate, since they are -both wall-powered and have terms of service that must be met. +both wall-powered and have quality of service constraints that must be met. % Therefore, they do not have to and cannot afford to sacrifice too much performance. @@ -53,7 +53,7 @@ energy constraints and it is both application and device independent---unlike existing metrics. Second, we are the first to characterize optimal frequency settings for -systems providing both CPU and memory DVFS. +systems providing CPU DVFS and memory DFS. % We find that closely tracking the optimal settings during execution produces many transitions and large frequency transition overhead. @@ -65,7 +65,7 @@ We characterize the relationship between the amount of performance loss and the rate of tuning for several benchmarks, and introduce the concepts of \textit{performance clusters} and \textit{stable regions} to aid the process. -We make following four contributions: +We make the following contributions: % \begin{enumerate} % @@ -74,7 +74,7 @@ system to express the amount of extra energy that can be used to improve performance. % \item We study the energy-performance trade-offs of systems that are capable -of both CPU and memory DVFS for multiple applications. We show that poor +of CPU DVFS and memory DFS for multiple applications. We show that poor frequency selection can hurt both performance and energy consumption. % \item We characterize the optimal frequency settings for multiple @@ -87,7 +87,7 @@ management algorithms. % \end{enumerate} -We use the \texttt{gem5} simulator, the Android smartphone platform and Linux +We use the \texttt{Gem5} simulator, the Android smartphone platform and Linux kernel, and an empirical power model to (1) measure the inefficiency of several applications for a wide range of frequency settings, (2) compute performance clusters, and (3) study how performance clusters evolve. @@ -112,4 +112,4 @@ studies their characteristics. % Section~\ref{sec-algo-implications} presents implications of using performance clusters on energy-management algorithms, and -Section~\ref{sec-conclusions} concludes. +Section~\ref{sec-conclusions} summarizes and concludes the paper. diff --git a/optimal_performance.tex b/optimal_performance.tex index 725351d..ebb5e59 100644 --- a/optimal_performance.tex +++ b/optimal_performance.tex @@ -5,7 +5,7 @@ \centering \includegraphics[width=\columnwidth]{figures/plots/496/2d_best_point_variation_mulineff/gobmk_2d_stable_point_mulineff_cpi_mpki.pdf} \vspace{-0.5em} -\caption{\textbf{Optimal Performance Point for \text{Gobmk} Across Inefficiencies:} At +\caption{\textbf{Optimal Performance Point for \textit{gobmk} Across Inefficiencies:} At low inefficiency budgets, the optimal frequency settings follow CPI of the application, and select high memory frequencies for memory intensive phases. % with %high CPI. @@ -36,7 +36,7 @@ inefficiency budget is a function of workload.} \end{subfigure}% \vspace{0.5em} \caption{\textbf{Performance Clusters of \textit{milc.}} -\textit{Milc} is CPU intensive to a large extent with some memory intensive +\textit{milc} is CPU intensive to a large extent with some memory intensive phases. At higher thresholds, while CPU frequency is tightly bound, performance clusters cover a wide range of memory settings due to small performance difference across these frequencies. } @@ -61,7 +61,7 @@ simulation noise, the algorithm selects the settings with highest CPU (first) and then memory frequency as this setting is bound to have highest performance among the other possibilities. -Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all +Figure~\ref{gobmk-optimal} plots the optimal settings for \textit{gobmk} for all benchmark samples (each of length 10~M instructions) across multiple inefficiency constraints. At low inefficiencies, the optimal settings follow the trends in CPI (cycles per instruction) and MPKI (misses per thousand diff --git a/performance_clusters.tex b/performance_clusters.tex index ff09118..d6f74c2 100644 --- a/performance_clusters.tex +++ b/performance_clusters.tex @@ -4,7 +4,7 @@ \centering \includegraphics[width=\columnwidth]{./figures/plots/496/stable_line_plots/lbm_stable_lineplot_annotated_5.pdf} \vspace{-0.5em} -\caption{\textbf{Stable Regions and Transitions for \textit{Lbm} with +\caption{\textbf{Stable Regions and Transitions for \textit{lbm} with Threshold of 5\% and Inefficiency Budget of 1.3:} Solid lines represent the stable regions and vertical dashed lines mark the transitions made by \textit{lbm}.} diff --git a/system_methodology.tex b/system_methodology.tex index 6f16544..7a392f5 100644 --- a/system_methodology.tex +++ b/system_methodology.tex @@ -25,7 +25,7 @@ performance for energy savings. %voltage could result in data corruption since the memory array itself is %asynchronous. As no current hardware systems support memory frequency scaling, -we resort to Gem5~\cite{Binkert:gem5}, a cycle-accurate full system simulator +we resort to \texttt{Gem5}~\cite{Binkert:gem5}, a cycle-accurate full system simulator %as a platform to perform our studies. @@ -34,21 +34,21 @@ to perform our studies. \centering \includegraphics[width=0.75\columnwidth]{./figures/plots/systemBlockDiagram.pdf} \caption{\textbf{System Block Diagram}: Blocks that are newly added or - significantly modified from Gem5 origin implementation are shaded.} + significantly modified from \texttt{Gem5} origin implementation are shaded.} \label{fig-system-block-diag} \end{figure} %We envision a system that consists of a CPU capable of tuning its voltage and %frequency and memory that supports frequency scaling. -Current Gem5 versions provide the infrastructure necessary to change CPU -frequency and voltage; we extended Gem5 DVFS to incorporate memory frequency -scaling. As shown in Figure~\ref{fig-system-block-diag}, Gem5 provides a DVFS +Current \texttt{Gem5} versions provide the infrastructure necessary to change CPU +frequency and voltage; we extended \texttt{Gem5} DVFS to incorporate memory frequency +scaling. As shown in Figure~\ref{fig-system-block-diag}, \texttt{Gem5} provides a DVFS controller device that provides an interface to control frequency by the OS at runtime. We developed a memory frequency governor similar to existing Linux CPU frequency governors. Timing and current parameters of DRAM are scaled with its frequency as described in the technical note from Micron~\cite{micronpower-TN-url}. %that are capable of tuning memory frequency at runtime. -The blocks that we added or significantly modified from Gem5's original +The blocks that we added or significantly modified from \texttt{Gem5}'s original implementation are shaded in Figure~\ref{fig-system-block-diag}. \begin{figure*}[t] @@ -75,15 +75,15 @@ and degrade performance simultaneously.} \subsection{Energy Models} \label{subsec-energy-models} -We developed energy models for the CPU and DRAM for our studies. Gem5 comes +We developed energy models for the CPU and DRAM for our studies. \texttt{Gem5} comes with the energy models for various DRAM chipsets. The -DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the +DRAMPower~\cite{drampower-tool} model is integrated into \texttt{Gem5} and computes the memory energy consumption periodically during the benchmark execution. However, -Gem5 lacks a model for CPU energy consumption. We developed a processor power +\texttt{Gem5} lacks a model for CPU energy consumption. We developed a processor power model based on empirical measurements of a PandaBoard~\cite{pandaboard-url} evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9 processor; this chipset is used in the mobile platform we want to emulate, the -Samsung Nexus S. We ran microbenchmarks designed to stress the PandaBoard to +Galaxy Nexus S. We ran microbenchmarks designed to stress the PandaBoard to its full utilization and measured power consumed using an Agilent~34411A multimeter. Because of the limitations of the platform, we could only measure peak dynamic power. Therefore, to model different voltage levels we scaled it @@ -97,7 +97,7 @@ processor is not computing, but unlike leakage power, background power scales with clock frequency. We measure background power by calculating the difference between the CPU power consumption in its power on idle state and deep sleep mode (not clocked). Because background power is clocked, it is -scaled in a similar manner to dynamic power. Leakage power comprises up to +scaled in a similar manner to dynamic power. Leakage power comprises up to 30\% of microprocessor peak power consumption~\cite{power7} and is linearly proportional to supply voltage~\cite{leakage-islped02}. @@ -109,8 +109,8 @@ proportional to supply voltage~\cite{leakage-islped02}. \subsection{Experimental Methodology} Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run on -the Gem5 full system simulator. We use default core configuration provided by -Gem5 in revision 10585, that is designed to reflect ARM Cortex-A15 processor +the \texttt{Gem5} full system simulator. We use default core configuration provided by +\texttt{Gem5} in revision 10585, that is designed to reflect ARM Cortex-A15 processor with L1 cache size of 64~KB with access latency of 2 core cycles and a unified L2 cache of size 2~MB with hit latency of 12 core cycles. The CPU and caches operate under the same clock domain. For our purposes, we have configured the @@ -147,12 +147,12 @@ benchmarks that have interesting and unique phases. %hours. We collected samples of a fixed amount of work so that each sample would -represent the same work even across different frequencies. In Gem5, we collected +represent the same work even across different frequencies. In \texttt{Gem5}, we collected performance and energy consumption data every 10~million user mode instructions. %this fixed sample of work makes . %By collecting data for a fixed amount of work (instructions) we are able to study frequency scaling for workloads; the alternative sampling in time . -Gem5 provides a mechanism to distinguish between user mode and +\texttt{Gem5} provides a mechanism to distinguish between user mode and kernel mode instructions. We used this feature to remove periodic OS traffic and enable a fair comparison across simulations of different CPU and memory frequencies. We used the collected performance and energy data to study the impact of workload dynamics on the @@ -162,7 +162,7 @@ a given inefficiency budget. Note that, all our studies are performed using performance or energy. The interplay of performance and energy consumption of CPU and memory frequency scaling is complex as pointed by CoScale~\cite{deng2012coscale}. In the next Section, we measure and characterize -the larger space of all system level performance and energy trade-offs +the larger space of system level performance and energy trade-offs of various CPU and memory frequency settings. %Although individual energy-performance trade-offs of DVFS for CPU and