From a160460b65da45fd7effab7c6cd7ed2e1c7085b1 Mon Sep 17 00:00:00 2001 From: Rizwana Begum Date: Thu, 13 Aug 2015 10:08:29 -0400 Subject: [PATCH] Incorporated Mark's comments --- abstract.tex | 8 ++++---- acknowledgement.tex | 5 +++++ inefficiency.tex | 24 ++++++++++++------------ inefficiency_speedup.tex | 10 +++++----- optimal_performance.tex | 4 ++-- paper.tex | 1 + performance_clusters.tex | 22 ++++++++++------------ system_methodology.tex | 45 +++++++++++++++++++++++---------------------- 8 files changed, 62 insertions(+), 57 deletions(-) create mode 100644 acknowledgement.tex diff --git a/abstract.tex b/abstract.tex index 7e6f584..a5d584c 100644 --- a/abstract.tex +++ b/abstract.tex @@ -1,10 +1,10 @@ \begin{abstract} Battery lifetime continues to be a top complaint about smartphones. Dynamic -voltage and frequency scaling (DVFS) has existed for mobile device CPUs for -some time, and can be used to dynamically trade off energy for performance. -To make more energy-performance tradeoffs possible, DVFS is beginning to be -applied to memory as well. +voltage and frequency scaling (DVFS) has existed for mobile device CPUs for some +time, and provides a tradeoff between energy and performance. DVFS is beginning +to be applied to memory as well to make more energy-performance tradeoffs +possible. We present the first characterization of the behavior and optimal frequency settings of workloads running both under \textit{energy constraints} and on diff --git a/acknowledgement.tex b/acknowledgement.tex new file mode 100644 index 0000000..0baad77 --- /dev/null +++ b/acknowledgement.tex @@ -0,0 +1,5 @@ +\section{Acknowledgement} +This material is based on work partially supported by NSF Collaborative Awards +CSR-1409014 and CSR-1409367. Any opinion, findings, and conclusions or +recommendations expressed in this material are those of the authors and do not +necessarily reflect the views of the National Science Foundation. diff --git a/inefficiency.tex b/inefficiency.tex index 597e3ed..8476405 100644 --- a/inefficiency.tex +++ b/inefficiency.tex @@ -7,24 +7,24 @@ management algorithms for mobile systems should optimize performance under \textit{energy constraints}. % While several researchers have proposed algorithms that work under energy -constraints~\cite{mobiheld09-cinder,ecosystem}, these approaches require that -the constraints are expressed in terms of absolute energy. +constraints, these approaches require that the constraints are expressed in +terms of absolute energy~\cite{mobiheld09-cinder,ecosystem}. % -For example, rate-limiting approaches~\cite{mobiheld09-cinder} take the -maximum energy that can be consumed in a given time period as an input. +For example, rate-limiting approaches take the maximum energy that can be +consumed in a given time period as an input~\cite{mobiheld09-cinder}. % Once the application consumes its limit, it is paused until the next time period begins. -Unfortunately, in practice it is difficult to choose absolute energy +Unfortunately, in practice, it is difficult to choose absolute energy constraints appropriately for a diverse group of applications without understanding their inherent energy needs. % Energy consumption varies across applications, devices, and operating conditions, making it impractical to choose an absolute energy budget. % -Also, absolute energy constraints may slow down applications to the point -that total energy consumption \textit{increases} at the same time that +Also, applying absolute energy constraints may slow down applications to the +point that total energy consumption \textit{increases} and performance is degraded. Other metrics that incorporate energy take the form of $Energy * Delay^n$. @@ -56,7 +56,7 @@ energy the application could have consumed ($E_{min}$) on the same device as inefficiency: $I = \frac{E}{E_{min}}$. % An \textit{inefficiency} of $1$ represents an application's most efficient -execution, while $1.5$ indicate the the application consumed $50\%$ more +execution, while $1.5$ indicates that the application consumed $50\%$ more energy that its most efficient execution. % Inefficiency is independent of workloads and devices and avoids the problems @@ -86,7 +86,7 @@ We continue by addressing these questions. % performance. Devices will operate between an inefficiency of 1 and $I_{max}$ which represents the unbounded energy constraint allowing the application to -consume unbounded energy to deliver the best performance. +consume as much energy as necessary to deliver the best performance. % $I_{max}$ depends upon applications and devices. % @@ -165,7 +165,7 @@ of instructions. %We envision a system capable of scaling voltage and frequency of CPU and only %frequency of DRAM. Our models consider cross-component interactions on performance and energy. -Performance model uses hardware performance counters to measure amount of time +The performance model uses hardware performance counters to measure amount of time each component is $Busy$ completing the work, $Idle$ stalled on the other component and $Waiting$ for more work. We designed systematic methodology to scale these states to estimate execution time of a given workload at different @@ -198,8 +198,8 @@ system~\cite{david2011memory,deng2012multiscale,deng2011memscale,diniz2007limiti % While most of the existing multi-component energy management approaches work under performance constraints, some have potential to be modified to work -under energy constraints and thus -inefficiency~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}. +under energy constraints and thus could operate under +inefficiency budget~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}. % We leave building some of these algorithms into a system as future work. % diff --git a/inefficiency_speedup.tex b/inefficiency_speedup.tex index 482d28a..5182d36 100644 --- a/inefficiency_speedup.tex +++ b/inefficiency_speedup.tex @@ -5,7 +5,7 @@ Scaling individual components---CPU and memory---using DVFS has been studied in the past %and %researchers have used it -to make power performance trade-offs. To the best of our knowledge, the prior +to make power performance trade-offs. To the best of our knowledge, prior work has not studied the system level energy-performance trade-offs of combined CPU and memory DVFS. %considering the interaction between CPU and memory @@ -43,8 +43,8 @@ We make three major observations: \noindent \textit{Running slower doesn't mean that system is running efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much -that its overall energy consumption increases, thereby resulting in 1.55 -inefficiency for \textit{gobmk}. Algorithms that choose these frequency settings spend +that its overall energy consumption increases, thereby resulting in +inefficiency of 1.55 for \textit{gobmk}. Algorithms that choose these frequency settings spend 55\% more energy without any performance improvement. %The converse is also true %as noted by our second observation. @@ -70,9 +70,9 @@ a) use no more than given inefficiency budget b) should use only as much inefficiency budget as needed c) and deliver the best performance. %\end{enumerate} -Consequently, like other constraints used by algorithms such as performance, power and absolute energy, $inefficiency$ +Consequently, like other constraints used by algorithms such as performance, power and absolute energy, inefficiency also allows energy management algorithms to waste system energy. We suggest -that, even though $inefficiency$ doesn't completely eliminate the problem of +that, even though inefficiency doesn't completely eliminate the problem of wasting energy, it mitigates the problem. For example, rate limiting approaches waste energy as energy budget is specified for a given amount of time interval and doesn't require a specific amount of work to be done within that budget. diff --git a/optimal_performance.tex b/optimal_performance.tex index ccb7819..93e86a9 100644 --- a/optimal_performance.tex +++ b/optimal_performance.tex @@ -58,7 +58,7 @@ possible frequency settings under given inefficiency budget. It then finds the CPU and memory frequency settings that result in highest speedup. In cases where multiple settings result in similar speedup (within 0.5\%), to filter out simulation noise, the algorithm selects the settings with highest CPU (first) -and memory frequency as this setting is bound to have highest performance among +and then memory frequency as this setting is bound to have highest performance among the other possibilities. Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all @@ -103,7 +103,7 @@ optimal settings for every sample may hinder some energy-performance trade-off that could have been made if performance was not so tightly bounded (to only highest performance). For example, \textit{bzip2} is CPU bound and therefore its performance at memory frequency of 200MHz is within 3\% of performance at a -memory frequency of 800MHz while CPU is running at 1000MHz. By sacrificing that +memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that 3\% of performance, the system could have consumed 1/4 the memory background energy staying well under the given inefficiency budget. %\end{enumerate} diff --git a/paper.tex b/paper.tex index df7fb37..0885b06 100644 --- a/paper.tex +++ b/paper.tex @@ -81,6 +81,7 @@ Geoffrey Challen, Mark Hempstead} \input{algorithm_implications.tex} % 20 Apr 2015 : GWA : Add things here as needed. \input{conclusions.tex} +\input{acknowledgement.tex} % 23 Sep 2014 : GWA : TODO : Reenable before submission. diff --git a/performance_clusters.tex b/performance_clusters.tex index 98918f3..21eb96a 100644 --- a/performance_clusters.tex +++ b/performance_clusters.tex @@ -122,8 +122,8 @@ Not all of the stable regions increase in length with increasing inefficiency bu %inefficiency is a %function of workload characteristics. If consecutive -samples of a workload have a small difference in performance but differ significantly in energy -consumption then only at +samples of a workload have a small difference in performance, but differ significantly in energy +consumption, then only at higher inefficiency budgets will the system find common settings for these consecutive samples. % because all settings under an inefficiency budget are considered. %Note that we find the performance clusters by considering @@ -144,7 +144,7 @@ Figure~\ref{clusters-milc} shows that \textit{milc} has similar trends as An interesting observation from the performance clusters is that algorithms like CoScale~\cite{deng2012coscale} that search for the best performing settings every interval starting -from the maximum frequency settings are not optimal. Algorithms can reduce the +from the maximum frequency settings are not efficient. Algorithms can reduce the overhead of optimal settings search by starting search from the settings selected for the previous interval as application phases are often stable for multiple sample intervals. %as the application phases don't change drastically in @@ -168,7 +168,7 @@ settings between the current sample performance cluster and the available settings until the previous sample. When the algorithm finds no more common samples, it marks the end of the stable region. If more than one frequency pair exists in the available settings for this region, the algorithm chooses the -setting with highest CPU (first) and memory frequency as optimal settings for this +setting with highest CPU (first) and then memory frequency as optimal settings for this region. Figure~\ref{lbm-stable-line-5-annotated} shows the CPU and memory frequency settings selected for stable regions of benchmark \textit{lbm}. It also has markers indicating the end of each stable region. In this figure, note that for @@ -176,11 +176,11 @@ every stable region (between any two markers) the frequency of both CPU and memo constant. %Note that -Our algorithm is not practical for real systems, it knows the characteristics of the +Our algorithm is not practical for real systems, as it knows the characteristics of the future samples and their performance clusters in the beginning of a stable region. % (and therefore is impractical to implement in real systems). We are -currently designing algorithms that are capable of tuning the system while +currently designing algorithms in hardware and software that are capable of tuning the system while running the application as future work. In Section~\ref{sec-algo-implications}, we propose ways in which length of stable regions and the available settings for a given region can be predicted for energy management algorithms in real systems. @@ -260,7 +260,7 @@ across benchmarks for multiple cluster thresholds at inefficiency budget of 1.3. \subsection{Energy-Performance Trade-offs} In this subsection we analyze the energy-performance trade-offs made by our -ideal algorithm. We then add tuning cost of our algorithm and compare the +ideal algorithm. We then add the tuning cost of our algorithm and compare the energy performance trade-offs across multiple applications. We study multiple cluster thresholds and an inefficiency budget of 1.3. @@ -300,9 +300,7 @@ frequency transitions. We assume tuning overhead of 500us and 30uJ, which includes computing inefficiencies, searching for the optimal setting and transition the hardware to new settings~\cite{deng2012coscale}. We assumed that a space of 100 settings is -searched for every transition. \textit{gobmk} is the only benchmark that shows a -performance improvement from the optimal settings when performance is allowed to -degrade, which is unexpected. We are investigating its root cause. +searched for every transition. %This is not intuitive and we are investigating the cause of this anomaly %\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find %the reason for the change}. @@ -317,8 +315,8 @@ samples. This results in longer stable regions. stable region. The longer the stable regions, the lower the number of transitions that the system need to make. \item Allowing a higher degradation in performance may, in fact, result in improved -performance when tuning overhead of algorithms is included due to reduction in -number of frequency transitions in the system. Consequently energy savings also +performance when tuning overhead is included due to reduction in +number of frequency transitions in the system, consequently energy savings also increase. \end{enumerate} diff --git a/system_methodology.tex b/system_methodology.tex index cdcc8e6..074b109 100644 --- a/system_methodology.tex +++ b/system_methodology.tex @@ -12,7 +12,8 @@ Recent research~\cite{david2011memory,deng2011memscale} has shown that DRAM frequency scaling also provides performance and energy trade-offs. -In this work, we scale frequency and voltage for the CPU and scale only frequency for memory. +In this work, we scale frequency and voltage for the CPU and scale only +frequency for the memory~\cite{david2011memory,deng2011memscale}. %In this work, we scale frequency and voltage for the CPU and for the memory, scale frequency only. %to make energy-performance trade-offs. %Dynamic Voltage and @@ -42,7 +43,7 @@ to perform our studies. Current Gem5 versions provide the infrastructure necessary to change CPU frequency and voltage; we extended Gem5 DVFS to incorporate memory frequency scaling. As shown in Figure~\ref{fig-system-block-diag}, Gem5 provides a DVFS -controller device that provides interface to control frequency by the OS at +controller device that provides an interface to control frequency by the OS at runtime. We developed a memory frequency governor similar to existing Linux CPU frequency governors. %that are capable of tuning memory frequency at runtime. @@ -76,14 +77,14 @@ We developed energy models for the CPU and DRAM for our studies. Gem5 comes with the energy models for various DRAM chipsets. The DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the memory energy consumption periodically during the benchmark execution. However, -Gem5 lacks a model for CPU energy consumption. We developed a processor power +Gem5 lacks a model for CPU energy consumption. We developed a processor power model based on empirical measurements of a PandaBoard~\cite{pandaboard-url} evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9 processor; this chipset is used in the mobile platform we want to emulate, the Samsung Nexus S. We ran microbenchmarks designed to stress the PandaBoard to its full utilization and measured power consumed using an Agilent~34411A multimeter. Because of the limitations of the platform, we could only measure -peak dynamic power. Therefore to model different voltage levels we scaled it +peak dynamic power. Therefore, to model different voltage levels we scaled it quadratically with voltage and linear with frequency $(P{\propto}V^{2}f)$. Our peak dynamic power agrees with the numbers reported by previous work~\cite{poweragile-hotos11} and the datasheets. @@ -105,22 +106,22 @@ proportional to supply voltage~\cite{leakage-islped02}. %with our CPU power model to compute CPU energy consumption of the application at run time. \subsection{Experimental Methodology} -Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run -on the Gem5 full system simulator. We model a Cortex-A9 processor, single core, -out-of-order CPU with an issue width of 8, L1 cache size of 64~KB with access -latency of 2 core cycles and a unified L2 cache of size 2~MB with hit latency of -12 core cycles. The CPU and caches operate under the same clock domain. For our -purposes, we have configured the CPU clock domain frequency to have a range of -100--1000~MHZ with highest voltage being 1.25V. +Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run on +the Gem5 full system simulator. We use default core configuration provided by +Gem5 in revision 10585, that is designed to reflect ARM Cortex-A15 processor +with L1 cache size of 64~KB with access latency of 2 core cycles and a unified +L2 cache of size 2~MB with hit latency of 12 core cycles. The CPU and caches +operate under the same clock domain. For our purposes, we have configured the +CPU clock domain frequency to have a range of 100--1000~MHZ with highest voltage +being 1.25V. % MH: This might confuse readers -%Our -%experiments with a simple ring oscillator show that voltage changes by -%0.02V/30MHz. The voltage and frequency pairs match with the frequency steps used -%by the Nexus S. +%Our experiments with a simple ring oscillator show that voltage changes by +%0.02V/30MHz. The voltage and frequency pairs match with the frequency steps +%used by the Nexus S. For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page policy. Timing and current parameters for LPDDR3 are configured as specified in -Micron data sheet~\cite{micronspec-url}. Memory clock domain is configured with a +data sheets from Micron~\cite{micronspec-url}. Memory clock domain is configured with a frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. @@ -137,10 +138,10 @@ benchmarks that have interesting and unique phases. %selected benchmarks that have interesting and unique phases with finer %frequency step granularity of 30MHz for CPU and 40MHz for memory, a total of %496 settings. -Due to limited resources and time, running simulations for all benchmarks with -finer frequency steps was difficult as it would have resulted in more than -10,000 simulations, where each simulation would take anywhere between 4 to 12 -hours. +%Due to limited resources and time, running simulations for all benchmarks with +%finer frequency steps was difficult as it would have resulted in more than +%10,000 simulations, where each simulation would take anywhere between 4 to 12 +%hours. We collected samples of a fixed amount of work so that each sample would represent the same work even across different frequencies. In Gem5, we collected @@ -160,10 +161,10 @@ performance or energy. Although individual energy-performance trade-offs of DVFS for CPU and DFS for memory have been studied in the past, the trade-off resulting from the cross-component interaction of these two components has not been -characterized. CoScale~\cite{deng2012coscale} did point out that +characterized. CoScale~\cite{deng2012coscale} did point out that interplay of performance and energy consumption of these two components is complex and did present a heuristic that attempts to -pick the optimal point. However, it did not measure and characterize +pick the optimal point. In the next Section, we measure and characterize the larger space of all system level performance and energy trade-offs of various CPU and memory frequency settings. %In the next section, we study how performance and -- libgit2 0.22.2