Commit a160460b65da45fd7effab7c6cd7ed2e1c7085b1
1 parent
3b0c7aa8
Incorporated Mark's comments
Showing
8 changed files
with
62 additions
and
57 deletions
abstract.tex
| 1 | \begin{abstract} | 1 | \begin{abstract} |
| 2 | 2 | ||
| 3 | Battery lifetime continues to be a top complaint about smartphones. Dynamic | 3 | Battery lifetime continues to be a top complaint about smartphones. Dynamic |
| 4 | -voltage and frequency scaling (DVFS) has existed for mobile device CPUs for | ||
| 5 | -some time, and can be used to dynamically trade off energy for performance. | ||
| 6 | -To make more energy-performance tradeoffs possible, DVFS is beginning to be | ||
| 7 | -applied to memory as well. | 4 | +voltage and frequency scaling (DVFS) has existed for mobile device CPUs for some |
| 5 | +time, and provides a tradeoff between energy and performance. DVFS is beginning | ||
| 6 | +to be applied to memory as well to make more energy-performance tradeoffs | ||
| 7 | +possible. | ||
| 8 | 8 | ||
| 9 | We present the first characterization of the behavior and optimal frequency | 9 | We present the first characterization of the behavior and optimal frequency |
| 10 | settings of workloads running both under \textit{energy constraints} and on | 10 | settings of workloads running both under \textit{energy constraints} and on |
acknowledgement.tex
0 → 100644
| 1 | +\section{Acknowledgement} | ||
| 2 | +This material is based on work partially supported by NSF Collaborative Awards | ||
| 3 | +CSR-1409014 and CSR-1409367. Any opinion, findings, and conclusions or | ||
| 4 | +recommendations expressed in this material are those of the authors and do not | ||
| 5 | +necessarily reflect the views of the National Science Foundation. |
inefficiency.tex
| @@ -7,24 +7,24 @@ management algorithms for mobile systems should optimize performance under | @@ -7,24 +7,24 @@ management algorithms for mobile systems should optimize performance under | ||
| 7 | \textit{energy constraints}. | 7 | \textit{energy constraints}. |
| 8 | % | 8 | % |
| 9 | While several researchers have proposed algorithms that work under energy | 9 | While several researchers have proposed algorithms that work under energy |
| 10 | -constraints~\cite{mobiheld09-cinder,ecosystem}, these approaches require that | ||
| 11 | -the constraints are expressed in terms of absolute energy. | 10 | +constraints, these approaches require that the constraints are expressed in |
| 11 | +terms of absolute energy~\cite{mobiheld09-cinder,ecosystem}. | ||
| 12 | % | 12 | % |
| 13 | -For example, rate-limiting approaches~\cite{mobiheld09-cinder} take the | ||
| 14 | -maximum energy that can be consumed in a given time period as an input. | 13 | +For example, rate-limiting approaches take the maximum energy that can be |
| 14 | +consumed in a given time period as an input~\cite{mobiheld09-cinder}. | ||
| 15 | % | 15 | % |
| 16 | Once the application consumes its limit, it is paused until the next time | 16 | Once the application consumes its limit, it is paused until the next time |
| 17 | period begins. | 17 | period begins. |
| 18 | 18 | ||
| 19 | -Unfortunately, in practice it is difficult to choose absolute energy | 19 | +Unfortunately, in practice, it is difficult to choose absolute energy |
| 20 | constraints appropriately for a diverse group of applications without | 20 | constraints appropriately for a diverse group of applications without |
| 21 | understanding their inherent energy needs. | 21 | understanding their inherent energy needs. |
| 22 | % | 22 | % |
| 23 | Energy consumption varies across applications, devices, and operating | 23 | Energy consumption varies across applications, devices, and operating |
| 24 | conditions, making it impractical to choose an absolute energy budget. | 24 | conditions, making it impractical to choose an absolute energy budget. |
| 25 | % | 25 | % |
| 26 | -Also, absolute energy constraints may slow down applications to the point | ||
| 27 | -that total energy consumption \textit{increases} at the same time that | 26 | +Also, applying absolute energy constraints may slow down applications to the |
| 27 | +point that total energy consumption \textit{increases} and | ||
| 28 | performance is degraded. | 28 | performance is degraded. |
| 29 | 29 | ||
| 30 | Other metrics that incorporate energy take the form of $Energy * Delay^n$. | 30 | Other metrics that incorporate energy take the form of $Energy * Delay^n$. |
| @@ -56,7 +56,7 @@ energy the application could have consumed ($E_{min}$) on the same device as | @@ -56,7 +56,7 @@ energy the application could have consumed ($E_{min}$) on the same device as | ||
| 56 | inefficiency: $I = \frac{E}{E_{min}}$. | 56 | inefficiency: $I = \frac{E}{E_{min}}$. |
| 57 | % | 57 | % |
| 58 | An \textit{inefficiency} of $1$ represents an application's most efficient | 58 | An \textit{inefficiency} of $1$ represents an application's most efficient |
| 59 | -execution, while $1.5$ indicate the the application consumed $50\%$ more | 59 | +execution, while $1.5$ indicates that the application consumed $50\%$ more |
| 60 | energy that its most efficient execution. | 60 | energy that its most efficient execution. |
| 61 | % | 61 | % |
| 62 | Inefficiency is independent of workloads and devices and avoids the problems | 62 | Inefficiency is independent of workloads and devices and avoids the problems |
| @@ -86,7 +86,7 @@ We continue by addressing these questions. | @@ -86,7 +86,7 @@ We continue by addressing these questions. | ||
| 86 | % performance. | 86 | % performance. |
| 87 | Devices will operate between an inefficiency of 1 and $I_{max}$ which | 87 | Devices will operate between an inefficiency of 1 and $I_{max}$ which |
| 88 | represents the unbounded energy constraint allowing the application to | 88 | represents the unbounded energy constraint allowing the application to |
| 89 | -consume unbounded energy to deliver the best performance. | 89 | +consume as much energy as necessary to deliver the best performance. |
| 90 | % | 90 | % |
| 91 | $I_{max}$ depends upon applications and devices. | 91 | $I_{max}$ depends upon applications and devices. |
| 92 | % | 92 | % |
| @@ -165,7 +165,7 @@ of instructions. | @@ -165,7 +165,7 @@ of instructions. | ||
| 165 | %We envision a system capable of scaling voltage and frequency of CPU and only | 165 | %We envision a system capable of scaling voltage and frequency of CPU and only |
| 166 | %frequency of DRAM. | 166 | %frequency of DRAM. |
| 167 | Our models consider cross-component interactions on performance and energy. | 167 | Our models consider cross-component interactions on performance and energy. |
| 168 | -Performance model uses hardware performance counters to measure amount of time | 168 | +The performance model uses hardware performance counters to measure amount of time |
| 169 | each component is $Busy$ completing the work, $Idle$ stalled on the other | 169 | each component is $Busy$ completing the work, $Idle$ stalled on the other |
| 170 | component and $Waiting$ for more work. We designed systematic methodology to | 170 | component and $Waiting$ for more work. We designed systematic methodology to |
| 171 | scale these states to estimate execution time of a given workload at different | 171 | scale these states to estimate execution time of a given workload at different |
| @@ -198,8 +198,8 @@ system~\cite{david2011memory,deng2012multiscale,deng2011memscale,diniz2007limiti | @@ -198,8 +198,8 @@ system~\cite{david2011memory,deng2012multiscale,deng2011memscale,diniz2007limiti | ||
| 198 | % | 198 | % |
| 199 | While most of the existing multi-component energy management approaches work | 199 | While most of the existing multi-component energy management approaches work |
| 200 | under performance constraints, some have potential to be modified to work | 200 | under performance constraints, some have potential to be modified to work |
| 201 | -under energy constraints and thus | ||
| 202 | -inefficiency~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}. | 201 | +under energy constraints and thus could operate under |
| 202 | +inefficiency budget~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}. | ||
| 203 | % | 203 | % |
| 204 | We leave building some of these algorithms into a system as future work. | 204 | We leave building some of these algorithms into a system as future work. |
| 205 | % | 205 | % |
inefficiency_speedup.tex
| @@ -5,7 +5,7 @@ Scaling individual components---CPU and memory---using DVFS has been studied in | @@ -5,7 +5,7 @@ Scaling individual components---CPU and memory---using DVFS has been studied in | ||
| 5 | the past | 5 | the past |
| 6 | %and | 6 | %and |
| 7 | %researchers have used it | 7 | %researchers have used it |
| 8 | -to make power performance trade-offs. To the best of our knowledge, the prior | 8 | +to make power performance trade-offs. To the best of our knowledge, prior |
| 9 | work has not studied the system level energy-performance trade-offs of combined | 9 | work has not studied the system level energy-performance trade-offs of combined |
| 10 | CPU and memory DVFS. | 10 | CPU and memory DVFS. |
| 11 | %considering the interaction between CPU and memory | 11 | %considering the interaction between CPU and memory |
| @@ -43,8 +43,8 @@ We make three major observations: | @@ -43,8 +43,8 @@ We make three major observations: | ||
| 43 | \noindent \textit{Running slower doesn't mean that system is running | 43 | \noindent \textit{Running slower doesn't mean that system is running |
| 44 | efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and | 44 | efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and |
| 45 | memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much | 45 | memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much |
| 46 | -that its overall energy consumption increases, thereby resulting in 1.55 | ||
| 47 | -inefficiency for \textit{gobmk}. Algorithms that choose these frequency settings spend | 46 | +that its overall energy consumption increases, thereby resulting in |
| 47 | +inefficiency of 1.55 for \textit{gobmk}. Algorithms that choose these frequency settings spend | ||
| 48 | 55\% more energy without any performance improvement. | 48 | 55\% more energy without any performance improvement. |
| 49 | %The converse is also true | 49 | %The converse is also true |
| 50 | %as noted by our second observation. | 50 | %as noted by our second observation. |
| @@ -70,9 +70,9 @@ a) use no more than given inefficiency budget b) should use only as much | @@ -70,9 +70,9 @@ a) use no more than given inefficiency budget b) should use only as much | ||
| 70 | inefficiency budget as needed c) and deliver the best performance. | 70 | inefficiency budget as needed c) and deliver the best performance. |
| 71 | %\end{enumerate} | 71 | %\end{enumerate} |
| 72 | 72 | ||
| 73 | -Consequently, like other constraints used by algorithms such as performance, power and absolute energy, $inefficiency$ | 73 | +Consequently, like other constraints used by algorithms such as performance, power and absolute energy, inefficiency |
| 74 | also allows energy management algorithms to waste system energy. We suggest | 74 | also allows energy management algorithms to waste system energy. We suggest |
| 75 | -that, even though $inefficiency$ doesn't completely eliminate the problem of | 75 | +that, even though inefficiency doesn't completely eliminate the problem of |
| 76 | wasting energy, it mitigates the problem. For example, rate limiting approaches | 76 | wasting energy, it mitigates the problem. For example, rate limiting approaches |
| 77 | waste energy as energy budget is specified for a given amount of time interval | 77 | waste energy as energy budget is specified for a given amount of time interval |
| 78 | and doesn't require a specific amount of work to be done within that budget. | 78 | and doesn't require a specific amount of work to be done within that budget. |
optimal_performance.tex
| @@ -58,7 +58,7 @@ possible frequency settings under given inefficiency budget. It then finds the | @@ -58,7 +58,7 @@ possible frequency settings under given inefficiency budget. It then finds the | ||
| 58 | CPU and memory frequency settings that result in highest speedup. In cases | 58 | CPU and memory frequency settings that result in highest speedup. In cases |
| 59 | where multiple settings result in similar speedup (within 0.5\%), to filter out | 59 | where multiple settings result in similar speedup (within 0.5\%), to filter out |
| 60 | simulation noise, the algorithm selects the settings with highest CPU (first) | 60 | simulation noise, the algorithm selects the settings with highest CPU (first) |
| 61 | -and memory frequency as this setting is bound to have highest performance among | 61 | +and then memory frequency as this setting is bound to have highest performance among |
| 62 | the other possibilities. | 62 | the other possibilities. |
| 63 | 63 | ||
| 64 | Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all | 64 | Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all |
| @@ -103,7 +103,7 @@ optimal settings for every sample may hinder some energy-performance trade-off | @@ -103,7 +103,7 @@ optimal settings for every sample may hinder some energy-performance trade-off | ||
| 103 | that could have been made if performance was not so tightly bounded (to only | 103 | that could have been made if performance was not so tightly bounded (to only |
| 104 | highest performance). For example, \textit{bzip2} is CPU bound and therefore | 104 | highest performance). For example, \textit{bzip2} is CPU bound and therefore |
| 105 | its performance at memory frequency of 200MHz is within 3\% of performance at a | 105 | its performance at memory frequency of 200MHz is within 3\% of performance at a |
| 106 | -memory frequency of 800MHz while CPU is running at 1000MHz. By sacrificing that | 106 | +memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that |
| 107 | 3\% of performance, the system could have consumed 1/4 the memory background | 107 | 3\% of performance, the system could have consumed 1/4 the memory background |
| 108 | energy staying well under the given inefficiency budget. | 108 | energy staying well under the given inefficiency budget. |
| 109 | %\end{enumerate} | 109 | %\end{enumerate} |
paper.tex
| @@ -81,6 +81,7 @@ Geoffrey Challen, Mark Hempstead} | @@ -81,6 +81,7 @@ Geoffrey Challen, Mark Hempstead} | ||
| 81 | \input{algorithm_implications.tex} | 81 | \input{algorithm_implications.tex} |
| 82 | % 20 Apr 2015 : GWA : Add things here as needed. | 82 | % 20 Apr 2015 : GWA : Add things here as needed. |
| 83 | \input{conclusions.tex} | 83 | \input{conclusions.tex} |
| 84 | +\input{acknowledgement.tex} | ||
| 84 | 85 | ||
| 85 | % 23 Sep 2014 : GWA : TODO : Reenable before submission. | 86 | % 23 Sep 2014 : GWA : TODO : Reenable before submission. |
| 86 | 87 |
performance_clusters.tex
| @@ -122,8 +122,8 @@ Not all of the stable regions increase in length with increasing inefficiency bu | @@ -122,8 +122,8 @@ Not all of the stable regions increase in length with increasing inefficiency bu | ||
| 122 | %inefficiency is a | 122 | %inefficiency is a |
| 123 | %function of workload characteristics. | 123 | %function of workload characteristics. |
| 124 | If consecutive | 124 | If consecutive |
| 125 | -samples of a workload have a small difference in performance but differ significantly in energy | ||
| 126 | -consumption then only at | 125 | +samples of a workload have a small difference in performance, but differ significantly in energy |
| 126 | +consumption, then only at | ||
| 127 | higher inefficiency budgets will the system find common settings for these | 127 | higher inefficiency budgets will the system find common settings for these |
| 128 | consecutive samples. % because all settings under an inefficiency budget are considered. | 128 | consecutive samples. % because all settings under an inefficiency budget are considered. |
| 129 | %Note that we find the performance clusters by considering | 129 | %Note that we find the performance clusters by considering |
| @@ -144,7 +144,7 @@ Figure~\ref{clusters-milc} shows that \textit{milc} has similar trends as | @@ -144,7 +144,7 @@ Figure~\ref{clusters-milc} shows that \textit{milc} has similar trends as | ||
| 144 | 144 | ||
| 145 | An interesting observation from the performance clusters is that algorithms | 145 | An interesting observation from the performance clusters is that algorithms |
| 146 | like CoScale~\cite{deng2012coscale} that search for the best performing settings every interval starting | 146 | like CoScale~\cite{deng2012coscale} that search for the best performing settings every interval starting |
| 147 | -from the maximum frequency settings are not optimal. Algorithms can reduce the | 147 | +from the maximum frequency settings are not efficient. Algorithms can reduce the |
| 148 | overhead of optimal settings search by starting search from the settings selected | 148 | overhead of optimal settings search by starting search from the settings selected |
| 149 | for the previous interval as application phases are often stable for multiple sample intervals. | 149 | for the previous interval as application phases are often stable for multiple sample intervals. |
| 150 | %as the application phases don't change drastically in | 150 | %as the application phases don't change drastically in |
| @@ -168,7 +168,7 @@ settings between the current sample performance cluster and the available | @@ -168,7 +168,7 @@ settings between the current sample performance cluster and the available | ||
| 168 | settings until the previous sample. When the algorithm finds no more common | 168 | settings until the previous sample. When the algorithm finds no more common |
| 169 | samples, it marks the end of the stable region. If more than one frequency pair | 169 | samples, it marks the end of the stable region. If more than one frequency pair |
| 170 | exists in the available settings for this region, the algorithm chooses the | 170 | exists in the available settings for this region, the algorithm chooses the |
| 171 | -setting with highest CPU (first) and memory frequency as optimal settings for this | 171 | +setting with highest CPU (first) and then memory frequency as optimal settings for this |
| 172 | region. Figure~\ref{lbm-stable-line-5-annotated} shows the CPU and memory frequency | 172 | region. Figure~\ref{lbm-stable-line-5-annotated} shows the CPU and memory frequency |
| 173 | settings selected for stable regions of benchmark \textit{lbm}. It also has | 173 | settings selected for stable regions of benchmark \textit{lbm}. It also has |
| 174 | markers indicating the end of each stable region. In this figure, note that for | 174 | markers indicating the end of each stable region. In this figure, note that for |
| @@ -176,11 +176,11 @@ every stable region (between any two markers) the frequency of both CPU and memo | @@ -176,11 +176,11 @@ every stable region (between any two markers) the frequency of both CPU and memo | ||
| 176 | constant. | 176 | constant. |
| 177 | 177 | ||
| 178 | %Note that | 178 | %Note that |
| 179 | -Our algorithm is not practical for real systems, it knows the characteristics of the | 179 | +Our algorithm is not practical for real systems, as it knows the characteristics of the |
| 180 | future samples and their performance clusters in the beginning of a stable | 180 | future samples and their performance clusters in the beginning of a stable |
| 181 | region. % (and therefore is impractical to implement in real systems). | 181 | region. % (and therefore is impractical to implement in real systems). |
| 182 | We are | 182 | We are |
| 183 | -currently designing algorithms that are capable of tuning the system while | 183 | +currently designing algorithms in hardware and software that are capable of tuning the system while |
| 184 | running the application as future work. In Section~\ref{sec-algo-implications}, we | 184 | running the application as future work. In Section~\ref{sec-algo-implications}, we |
| 185 | propose ways in which length of stable regions and the available settings for a | 185 | propose ways in which length of stable regions and the available settings for a |
| 186 | given region can be predicted for energy management algorithms in real systems. | 186 | given region can be predicted for energy management algorithms in real systems. |
| @@ -260,7 +260,7 @@ across benchmarks for multiple cluster thresholds at inefficiency budget of 1.3. | @@ -260,7 +260,7 @@ across benchmarks for multiple cluster thresholds at inefficiency budget of 1.3. | ||
| 260 | 260 | ||
| 261 | \subsection{Energy-Performance Trade-offs} | 261 | \subsection{Energy-Performance Trade-offs} |
| 262 | In this subsection we analyze the energy-performance trade-offs made by our | 262 | In this subsection we analyze the energy-performance trade-offs made by our |
| 263 | -ideal algorithm. We then add tuning cost of our algorithm and compare the | 263 | +ideal algorithm. We then add the tuning cost of our algorithm and compare the |
| 264 | energy performance trade-offs across multiple applications. We study multiple | 264 | energy performance trade-offs across multiple applications. We study multiple |
| 265 | cluster thresholds and an inefficiency budget of 1.3. | 265 | cluster thresholds and an inefficiency budget of 1.3. |
| 266 | 266 | ||
| @@ -300,9 +300,7 @@ frequency transitions. We assume tuning overhead | @@ -300,9 +300,7 @@ frequency transitions. We assume tuning overhead | ||
| 300 | of 500us and 30uJ, which includes computing inefficiencies, searching for the | 300 | of 500us and 30uJ, which includes computing inefficiencies, searching for the |
| 301 | optimal setting and transition the hardware to new | 301 | optimal setting and transition the hardware to new |
| 302 | settings~\cite{deng2012coscale}. We assumed that a space of 100 settings is | 302 | settings~\cite{deng2012coscale}. We assumed that a space of 100 settings is |
| 303 | -searched for every transition. \textit{gobmk} is the only benchmark that shows a | ||
| 304 | -performance improvement from the optimal settings when performance is allowed to | ||
| 305 | -degrade, which is unexpected. We are investigating its root cause. | 303 | +searched for every transition. |
| 306 | %This is not intuitive and we are investigating the cause of this anomaly | 304 | %This is not intuitive and we are investigating the cause of this anomaly |
| 307 | %\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find | 305 | %\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find |
| 308 | %the reason for the change}. | 306 | %the reason for the change}. |
| @@ -317,8 +315,8 @@ samples. This results in longer stable regions. | @@ -317,8 +315,8 @@ samples. This results in longer stable regions. | ||
| 317 | stable region. The longer the stable regions, the lower | 315 | stable region. The longer the stable regions, the lower |
| 318 | the number of transitions that the system need to make. | 316 | the number of transitions that the system need to make. |
| 319 | \item Allowing a higher degradation in performance may, in fact, result in improved | 317 | \item Allowing a higher degradation in performance may, in fact, result in improved |
| 320 | -performance when tuning overhead of algorithms is included due to reduction in | ||
| 321 | -number of frequency transitions in the system. Consequently energy savings also | 318 | +performance when tuning overhead is included due to reduction in |
| 319 | +number of frequency transitions in the system, consequently energy savings also | ||
| 322 | increase. | 320 | increase. |
| 323 | \end{enumerate} | 321 | \end{enumerate} |
| 324 | 322 |
system_methodology.tex
| @@ -12,7 +12,8 @@ Recent | @@ -12,7 +12,8 @@ Recent | ||
| 12 | research~\cite{david2011memory,deng2011memscale} has shown that DRAM frequency scaling | 12 | research~\cite{david2011memory,deng2011memscale} has shown that DRAM frequency scaling |
| 13 | also provides performance and energy trade-offs. | 13 | also provides performance and energy trade-offs. |
| 14 | 14 | ||
| 15 | -In this work, we scale frequency and voltage for the CPU and scale only frequency for memory. | 15 | +In this work, we scale frequency and voltage for the CPU and scale only |
| 16 | +frequency for the memory~\cite{david2011memory,deng2011memscale}. | ||
| 16 | %In this work, we scale frequency and voltage for the CPU and for the memory, scale frequency only. | 17 | %In this work, we scale frequency and voltage for the CPU and for the memory, scale frequency only. |
| 17 | %to make energy-performance trade-offs. | 18 | %to make energy-performance trade-offs. |
| 18 | %Dynamic Voltage and | 19 | %Dynamic Voltage and |
| @@ -42,7 +43,7 @@ to perform our studies. | @@ -42,7 +43,7 @@ to perform our studies. | ||
| 42 | Current Gem5 versions provide the infrastructure necessary to change CPU | 43 | Current Gem5 versions provide the infrastructure necessary to change CPU |
| 43 | frequency and voltage; we extended Gem5 DVFS to incorporate memory frequency | 44 | frequency and voltage; we extended Gem5 DVFS to incorporate memory frequency |
| 44 | scaling. As shown in Figure~\ref{fig-system-block-diag}, Gem5 provides a DVFS | 45 | scaling. As shown in Figure~\ref{fig-system-block-diag}, Gem5 provides a DVFS |
| 45 | -controller device that provides interface to control frequency by the OS at | 46 | +controller device that provides an interface to control frequency by the OS at |
| 46 | runtime. We developed a memory frequency governor similar to existing Linux CPU | 47 | runtime. We developed a memory frequency governor similar to existing Linux CPU |
| 47 | frequency governors. | 48 | frequency governors. |
| 48 | %that are capable of tuning memory frequency at runtime. | 49 | %that are capable of tuning memory frequency at runtime. |
| @@ -76,14 +77,14 @@ We developed energy models for the CPU and DRAM for our studies. Gem5 comes | @@ -76,14 +77,14 @@ We developed energy models for the CPU and DRAM for our studies. Gem5 comes | ||
| 76 | with the energy models for various DRAM chipsets. The | 77 | with the energy models for various DRAM chipsets. The |
| 77 | DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the | 78 | DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the |
| 78 | memory energy consumption periodically during the benchmark execution. However, | 79 | memory energy consumption periodically during the benchmark execution. However, |
| 79 | -Gem5 lacks a model for CPU energy consumption. We developed a processor power | 80 | +Gem5 lacks a model for CPU energy consumption. We developed a processor power |
| 80 | model based on empirical measurements of a PandaBoard~\cite{pandaboard-url} | 81 | model based on empirical measurements of a PandaBoard~\cite{pandaboard-url} |
| 81 | evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9 | 82 | evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9 |
| 82 | processor; this chipset is used in the mobile platform we want to emulate, the | 83 | processor; this chipset is used in the mobile platform we want to emulate, the |
| 83 | Samsung Nexus S. We ran microbenchmarks designed to stress the PandaBoard to | 84 | Samsung Nexus S. We ran microbenchmarks designed to stress the PandaBoard to |
| 84 | its full utilization and measured power consumed using an Agilent~34411A | 85 | its full utilization and measured power consumed using an Agilent~34411A |
| 85 | multimeter. Because of the limitations of the platform, we could only measure | 86 | multimeter. Because of the limitations of the platform, we could only measure |
| 86 | -peak dynamic power. Therefore to model different voltage levels we scaled it | 87 | +peak dynamic power. Therefore, to model different voltage levels we scaled it |
| 87 | quadratically with voltage and linear with frequency $(P{\propto}V^{2}f)$. Our | 88 | quadratically with voltage and linear with frequency $(P{\propto}V^{2}f)$. Our |
| 88 | peak dynamic power agrees with the numbers reported by previous | 89 | peak dynamic power agrees with the numbers reported by previous |
| 89 | work~\cite{poweragile-hotos11} and the datasheets. | 90 | work~\cite{poweragile-hotos11} and the datasheets. |
| @@ -105,22 +106,22 @@ proportional to supply voltage~\cite{leakage-islped02}. | @@ -105,22 +106,22 @@ proportional to supply voltage~\cite{leakage-islped02}. | ||
| 105 | %with our CPU power model to compute CPU energy consumption of the application at run time. | 106 | %with our CPU power model to compute CPU energy consumption of the application at run time. |
| 106 | 107 | ||
| 107 | \subsection{Experimental Methodology} | 108 | \subsection{Experimental Methodology} |
| 108 | -Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run | ||
| 109 | -on the Gem5 full system simulator. We model a Cortex-A9 processor, single core, | ||
| 110 | -out-of-order CPU with an issue width of 8, L1 cache size of 64~KB with access | ||
| 111 | -latency of 2 core cycles and a unified L2 cache of size 2~MB with hit latency of | ||
| 112 | -12 core cycles. The CPU and caches operate under the same clock domain. For our | ||
| 113 | -purposes, we have configured the CPU clock domain frequency to have a range of | ||
| 114 | -100--1000~MHZ with highest voltage being 1.25V. | 109 | +Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run on |
| 110 | +the Gem5 full system simulator. We use default core configuration provided by | ||
| 111 | +Gem5 in revision 10585, that is designed to reflect ARM Cortex-A15 processor | ||
| 112 | +with L1 cache size of 64~KB with access latency of 2 core cycles and a unified | ||
| 113 | +L2 cache of size 2~MB with hit latency of 12 core cycles. The CPU and caches | ||
| 114 | +operate under the same clock domain. For our purposes, we have configured the | ||
| 115 | +CPU clock domain frequency to have a range of 100--1000~MHZ with highest voltage | ||
| 116 | +being 1.25V. | ||
| 115 | % MH: This might confuse readers | 117 | % MH: This might confuse readers |
| 116 | -%Our | ||
| 117 | -%experiments with a simple ring oscillator show that voltage changes by | ||
| 118 | -%0.02V/30MHz. The voltage and frequency pairs match with the frequency steps used | ||
| 119 | -%by the Nexus S. | 118 | +%Our experiments with a simple ring oscillator show that voltage changes by |
| 119 | +%0.02V/30MHz. The voltage and frequency pairs match with the frequency steps | ||
| 120 | +%used by the Nexus S. | ||
| 120 | 121 | ||
| 121 | For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page | 122 | For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page |
| 122 | policy. Timing and current parameters for LPDDR3 are configured as specified in | 123 | policy. Timing and current parameters for LPDDR3 are configured as specified in |
| 123 | -Micron data sheet~\cite{micronspec-url}. Memory clock domain is configured with a | 124 | +data sheets from Micron~\cite{micronspec-url}. Memory clock domain is configured with a |
| 124 | frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory | 125 | frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory |
| 125 | voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. | 126 | voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. |
| 126 | 127 | ||
| @@ -137,10 +138,10 @@ benchmarks that have interesting and unique phases. | @@ -137,10 +138,10 @@ benchmarks that have interesting and unique phases. | ||
| 137 | %selected benchmarks that have interesting and unique phases with finer | 138 | %selected benchmarks that have interesting and unique phases with finer |
| 138 | %frequency step granularity of 30MHz for CPU and 40MHz for memory, a total of | 139 | %frequency step granularity of 30MHz for CPU and 40MHz for memory, a total of |
| 139 | %496 settings. | 140 | %496 settings. |
| 140 | -Due to limited resources and time, running simulations for all benchmarks with | ||
| 141 | -finer frequency steps was difficult as it would have resulted in more than | ||
| 142 | -10,000 simulations, where each simulation would take anywhere between 4 to 12 | ||
| 143 | -hours. | 141 | +%Due to limited resources and time, running simulations for all benchmarks with |
| 142 | +%finer frequency steps was difficult as it would have resulted in more than | ||
| 143 | +%10,000 simulations, where each simulation would take anywhere between 4 to 12 | ||
| 144 | +%hours. | ||
| 144 | 145 | ||
| 145 | We collected samples of a fixed amount of work so that each sample would | 146 | We collected samples of a fixed amount of work so that each sample would |
| 146 | represent the same work even across different frequencies. In Gem5, we collected | 147 | represent the same work even across different frequencies. In Gem5, we collected |
| @@ -160,10 +161,10 @@ performance or energy. | @@ -160,10 +161,10 @@ performance or energy. | ||
| 160 | Although individual energy-performance trade-offs of DVFS for CPU and | 161 | Although individual energy-performance trade-offs of DVFS for CPU and |
| 161 | DFS for memory have been studied in the past, the trade-off resulting from | 162 | DFS for memory have been studied in the past, the trade-off resulting from |
| 162 | the cross-component interaction of these two components has not been | 163 | the cross-component interaction of these two components has not been |
| 163 | -characterized. CoScale~\cite{deng2012coscale} did point out that | 164 | +characterized. CoScale~\cite{deng2012coscale} did point out that |
| 164 | interplay of performance and energy consumption of these two | 165 | interplay of performance and energy consumption of these two |
| 165 | components is complex and did present a heuristic that attempts to | 166 | components is complex and did present a heuristic that attempts to |
| 166 | -pick the optimal point. However, it did not measure and characterize | 167 | +pick the optimal point. In the next Section, we measure and characterize |
| 167 | the larger space of all system level performance and energy trade-offs | 168 | the larger space of all system level performance and energy trade-offs |
| 168 | of various CPU and memory frequency settings. | 169 | of various CPU and memory frequency settings. |
| 169 | %In the next section, we study how performance and | 170 | %In the next section, we study how performance and |