Commit a160460b65da45fd7effab7c6cd7ed2e1c7085b1

Authored by Rizwana Begum
1 parent 3b0c7aa8

Incorporated Mark's comments

abstract.tex
1 \begin{abstract} 1 \begin{abstract}
2 2
3 Battery lifetime continues to be a top complaint about smartphones. Dynamic 3 Battery lifetime continues to be a top complaint about smartphones. Dynamic
4 -voltage and frequency scaling (DVFS) has existed for mobile device CPUs for  
5 -some time, and can be used to dynamically trade off energy for performance.  
6 -To make more energy-performance tradeoffs possible, DVFS is beginning to be  
7 -applied to memory as well. 4 +voltage and frequency scaling (DVFS) has existed for mobile device CPUs for some
  5 +time, and provides a tradeoff between energy and performance. DVFS is beginning
  6 +to be applied to memory as well to make more energy-performance tradeoffs
  7 +possible.
8 8
9 We present the first characterization of the behavior and optimal frequency 9 We present the first characterization of the behavior and optimal frequency
10 settings of workloads running both under \textit{energy constraints} and on 10 settings of workloads running both under \textit{energy constraints} and on
acknowledgement.tex 0 → 100644
  1 +\section{Acknowledgement}
  2 +This material is based on work partially supported by NSF Collaborative Awards
  3 +CSR-1409014 and CSR-1409367. Any opinion, findings, and conclusions or
  4 +recommendations expressed in this material are those of the authors and do not
  5 +necessarily reflect the views of the National Science Foundation.
inefficiency.tex
@@ -7,24 +7,24 @@ management algorithms for mobile systems should optimize performance under @@ -7,24 +7,24 @@ management algorithms for mobile systems should optimize performance under
7 \textit{energy constraints}. 7 \textit{energy constraints}.
8 % 8 %
9 While several researchers have proposed algorithms that work under energy 9 While several researchers have proposed algorithms that work under energy
10 -constraints~\cite{mobiheld09-cinder,ecosystem}, these approaches require that  
11 -the constraints are expressed in terms of absolute energy. 10 +constraints, these approaches require that the constraints are expressed in
  11 +terms of absolute energy~\cite{mobiheld09-cinder,ecosystem}.
12 % 12 %
13 -For example, rate-limiting approaches~\cite{mobiheld09-cinder} take the  
14 -maximum energy that can be consumed in a given time period as an input. 13 +For example, rate-limiting approaches take the maximum energy that can be
  14 +consumed in a given time period as an input~\cite{mobiheld09-cinder}.
15 % 15 %
16 Once the application consumes its limit, it is paused until the next time 16 Once the application consumes its limit, it is paused until the next time
17 period begins. 17 period begins.
18 18
19 -Unfortunately, in practice it is difficult to choose absolute energy 19 +Unfortunately, in practice, it is difficult to choose absolute energy
20 constraints appropriately for a diverse group of applications without 20 constraints appropriately for a diverse group of applications without
21 understanding their inherent energy needs. 21 understanding their inherent energy needs.
22 % 22 %
23 Energy consumption varies across applications, devices, and operating 23 Energy consumption varies across applications, devices, and operating
24 conditions, making it impractical to choose an absolute energy budget. 24 conditions, making it impractical to choose an absolute energy budget.
25 % 25 %
26 -Also, absolute energy constraints may slow down applications to the point  
27 -that total energy consumption \textit{increases} at the same time that 26 +Also, applying absolute energy constraints may slow down applications to the
  27 +point that total energy consumption \textit{increases} and
28 performance is degraded. 28 performance is degraded.
29 29
30 Other metrics that incorporate energy take the form of $Energy * Delay^n$. 30 Other metrics that incorporate energy take the form of $Energy * Delay^n$.
@@ -56,7 +56,7 @@ energy the application could have consumed ($E_{min}$) on the same device as @@ -56,7 +56,7 @@ energy the application could have consumed ($E_{min}$) on the same device as
56 inefficiency: $I = \frac{E}{E_{min}}$. 56 inefficiency: $I = \frac{E}{E_{min}}$.
57 % 57 %
58 An \textit{inefficiency} of $1$ represents an application's most efficient 58 An \textit{inefficiency} of $1$ represents an application's most efficient
59 -execution, while $1.5$ indicate the the application consumed $50\%$ more 59 +execution, while $1.5$ indicates that the application consumed $50\%$ more
60 energy that its most efficient execution. 60 energy that its most efficient execution.
61 % 61 %
62 Inefficiency is independent of workloads and devices and avoids the problems 62 Inefficiency is independent of workloads and devices and avoids the problems
@@ -86,7 +86,7 @@ We continue by addressing these questions. @@ -86,7 +86,7 @@ We continue by addressing these questions.
86 % performance. 86 % performance.
87 Devices will operate between an inefficiency of 1 and $I_{max}$ which 87 Devices will operate between an inefficiency of 1 and $I_{max}$ which
88 represents the unbounded energy constraint allowing the application to 88 represents the unbounded energy constraint allowing the application to
89 -consume unbounded energy to deliver the best performance. 89 +consume as much energy as necessary to deliver the best performance.
90 % 90 %
91 $I_{max}$ depends upon applications and devices. 91 $I_{max}$ depends upon applications and devices.
92 % 92 %
@@ -165,7 +165,7 @@ of instructions. @@ -165,7 +165,7 @@ of instructions.
165 %We envision a system capable of scaling voltage and frequency of CPU and only 165 %We envision a system capable of scaling voltage and frequency of CPU and only
166 %frequency of DRAM. 166 %frequency of DRAM.
167 Our models consider cross-component interactions on performance and energy. 167 Our models consider cross-component interactions on performance and energy.
168 -Performance model uses hardware performance counters to measure amount of time 168 +The performance model uses hardware performance counters to measure amount of time
169 each component is $Busy$ completing the work, $Idle$ stalled on the other 169 each component is $Busy$ completing the work, $Idle$ stalled on the other
170 component and $Waiting$ for more work. We designed systematic methodology to 170 component and $Waiting$ for more work. We designed systematic methodology to
171 scale these states to estimate execution time of a given workload at different 171 scale these states to estimate execution time of a given workload at different
@@ -198,8 +198,8 @@ system~\cite{david2011memory,deng2012multiscale,deng2011memscale,diniz2007limiti @@ -198,8 +198,8 @@ system~\cite{david2011memory,deng2012multiscale,deng2011memscale,diniz2007limiti
198 % 198 %
199 While most of the existing multi-component energy management approaches work 199 While most of the existing multi-component energy management approaches work
200 under performance constraints, some have potential to be modified to work 200 under performance constraints, some have potential to be modified to work
201 -under energy constraints and thus  
202 -inefficiency~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}. 201 +under energy constraints and thus could operate under
  202 +inefficiency budget~\cite{bitirgen2008coordinated,deng2012coscale,chen2011coordinating,fan2005synergy,felter2005performance,li2007cross,raghavendra2008no}.
203 % 203 %
204 We leave building some of these algorithms into a system as future work. 204 We leave building some of these algorithms into a system as future work.
205 % 205 %
inefficiency_speedup.tex
@@ -5,7 +5,7 @@ Scaling individual components---CPU and memory---using DVFS has been studied in @@ -5,7 +5,7 @@ Scaling individual components---CPU and memory---using DVFS has been studied in
5 the past 5 the past
6 %and 6 %and
7 %researchers have used it 7 %researchers have used it
8 -to make power performance trade-offs. To the best of our knowledge, the prior 8 +to make power performance trade-offs. To the best of our knowledge, prior
9 work has not studied the system level energy-performance trade-offs of combined 9 work has not studied the system level energy-performance trade-offs of combined
10 CPU and memory DVFS. 10 CPU and memory DVFS.
11 %considering the interaction between CPU and memory 11 %considering the interaction between CPU and memory
@@ -43,8 +43,8 @@ We make three major observations: @@ -43,8 +43,8 @@ We make three major observations:
43 \noindent \textit{Running slower doesn't mean that system is running 43 \noindent \textit{Running slower doesn't mean that system is running
44 efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and 44 efficiently.} At the lowest frequencies, 100MHz and 200MHz for CPU and
45 memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much 45 memory respectively, \textit{gobmk} takes the longest to execute. These settings slow down the application so much
46 -that its overall energy consumption increases, thereby resulting in 1.55  
47 -inefficiency for \textit{gobmk}. Algorithms that choose these frequency settings spend 46 +that its overall energy consumption increases, thereby resulting in
  47 +inefficiency of 1.55 for \textit{gobmk}. Algorithms that choose these frequency settings spend
48 55\% more energy without any performance improvement. 48 55\% more energy without any performance improvement.
49 %The converse is also true 49 %The converse is also true
50 %as noted by our second observation. 50 %as noted by our second observation.
@@ -70,9 +70,9 @@ a) use no more than given inefficiency budget b) should use only as much @@ -70,9 +70,9 @@ a) use no more than given inefficiency budget b) should use only as much
70 inefficiency budget as needed c) and deliver the best performance. 70 inefficiency budget as needed c) and deliver the best performance.
71 %\end{enumerate} 71 %\end{enumerate}
72 72
73 -Consequently, like other constraints used by algorithms such as performance, power and absolute energy, $inefficiency$ 73 +Consequently, like other constraints used by algorithms such as performance, power and absolute energy, inefficiency
74 also allows energy management algorithms to waste system energy. We suggest 74 also allows energy management algorithms to waste system energy. We suggest
75 -that, even though $inefficiency$ doesn't completely eliminate the problem of 75 +that, even though inefficiency doesn't completely eliminate the problem of
76 wasting energy, it mitigates the problem. For example, rate limiting approaches 76 wasting energy, it mitigates the problem. For example, rate limiting approaches
77 waste energy as energy budget is specified for a given amount of time interval 77 waste energy as energy budget is specified for a given amount of time interval
78 and doesn't require a specific amount of work to be done within that budget. 78 and doesn't require a specific amount of work to be done within that budget.
optimal_performance.tex
@@ -58,7 +58,7 @@ possible frequency settings under given inefficiency budget. It then finds the @@ -58,7 +58,7 @@ possible frequency settings under given inefficiency budget. It then finds the
58 CPU and memory frequency settings that result in highest speedup. In cases 58 CPU and memory frequency settings that result in highest speedup. In cases
59 where multiple settings result in similar speedup (within 0.5\%), to filter out 59 where multiple settings result in similar speedup (within 0.5\%), to filter out
60 simulation noise, the algorithm selects the settings with highest CPU (first) 60 simulation noise, the algorithm selects the settings with highest CPU (first)
61 -and memory frequency as this setting is bound to have highest performance among 61 +and then memory frequency as this setting is bound to have highest performance among
62 the other possibilities. 62 the other possibilities.
63 63
64 Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all 64 Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all
@@ -103,7 +103,7 @@ optimal settings for every sample may hinder some energy-performance trade-off @@ -103,7 +103,7 @@ optimal settings for every sample may hinder some energy-performance trade-off
103 that could have been made if performance was not so tightly bounded (to only 103 that could have been made if performance was not so tightly bounded (to only
104 highest performance). For example, \textit{bzip2} is CPU bound and therefore 104 highest performance). For example, \textit{bzip2} is CPU bound and therefore
105 its performance at memory frequency of 200MHz is within 3\% of performance at a 105 its performance at memory frequency of 200MHz is within 3\% of performance at a
106 -memory frequency of 800MHz while CPU is running at 1000MHz. By sacrificing that 106 +memory frequency of 800MHz while the CPU is running at 1000MHz. By sacrificing that
107 3\% of performance, the system could have consumed 1/4 the memory background 107 3\% of performance, the system could have consumed 1/4 the memory background
108 energy staying well under the given inefficiency budget. 108 energy staying well under the given inefficiency budget.
109 %\end{enumerate} 109 %\end{enumerate}
paper.tex
@@ -81,6 +81,7 @@ Geoffrey Challen, Mark Hempstead} @@ -81,6 +81,7 @@ Geoffrey Challen, Mark Hempstead}
81 \input{algorithm_implications.tex} 81 \input{algorithm_implications.tex}
82 % 20 Apr 2015 : GWA : Add things here as needed. 82 % 20 Apr 2015 : GWA : Add things here as needed.
83 \input{conclusions.tex} 83 \input{conclusions.tex}
  84 +\input{acknowledgement.tex}
84 85
85 % 23 Sep 2014 : GWA : TODO : Reenable before submission. 86 % 23 Sep 2014 : GWA : TODO : Reenable before submission.
86 87
performance_clusters.tex
@@ -122,8 +122,8 @@ Not all of the stable regions increase in length with increasing inefficiency bu @@ -122,8 +122,8 @@ Not all of the stable regions increase in length with increasing inefficiency bu
122 %inefficiency is a 122 %inefficiency is a
123 %function of workload characteristics. 123 %function of workload characteristics.
124 If consecutive 124 If consecutive
125 -samples of a workload have a small difference in performance but differ significantly in energy  
126 -consumption then only at 125 +samples of a workload have a small difference in performance, but differ significantly in energy
  126 +consumption, then only at
127 higher inefficiency budgets will the system find common settings for these 127 higher inefficiency budgets will the system find common settings for these
128 consecutive samples. % because all settings under an inefficiency budget are considered. 128 consecutive samples. % because all settings under an inefficiency budget are considered.
129 %Note that we find the performance clusters by considering 129 %Note that we find the performance clusters by considering
@@ -144,7 +144,7 @@ Figure~\ref{clusters-milc} shows that \textit{milc} has similar trends as @@ -144,7 +144,7 @@ Figure~\ref{clusters-milc} shows that \textit{milc} has similar trends as
144 144
145 An interesting observation from the performance clusters is that algorithms 145 An interesting observation from the performance clusters is that algorithms
146 like CoScale~\cite{deng2012coscale} that search for the best performing settings every interval starting 146 like CoScale~\cite{deng2012coscale} that search for the best performing settings every interval starting
147 -from the maximum frequency settings are not optimal. Algorithms can reduce the 147 +from the maximum frequency settings are not efficient. Algorithms can reduce the
148 overhead of optimal settings search by starting search from the settings selected 148 overhead of optimal settings search by starting search from the settings selected
149 for the previous interval as application phases are often stable for multiple sample intervals. 149 for the previous interval as application phases are often stable for multiple sample intervals.
150 %as the application phases don't change drastically in 150 %as the application phases don't change drastically in
@@ -168,7 +168,7 @@ settings between the current sample performance cluster and the available @@ -168,7 +168,7 @@ settings between the current sample performance cluster and the available
168 settings until the previous sample. When the algorithm finds no more common 168 settings until the previous sample. When the algorithm finds no more common
169 samples, it marks the end of the stable region. If more than one frequency pair 169 samples, it marks the end of the stable region. If more than one frequency pair
170 exists in the available settings for this region, the algorithm chooses the 170 exists in the available settings for this region, the algorithm chooses the
171 -setting with highest CPU (first) and memory frequency as optimal settings for this 171 +setting with highest CPU (first) and then memory frequency as optimal settings for this
172 region. Figure~\ref{lbm-stable-line-5-annotated} shows the CPU and memory frequency 172 region. Figure~\ref{lbm-stable-line-5-annotated} shows the CPU and memory frequency
173 settings selected for stable regions of benchmark \textit{lbm}. It also has 173 settings selected for stable regions of benchmark \textit{lbm}. It also has
174 markers indicating the end of each stable region. In this figure, note that for 174 markers indicating the end of each stable region. In this figure, note that for
@@ -176,11 +176,11 @@ every stable region (between any two markers) the frequency of both CPU and memo @@ -176,11 +176,11 @@ every stable region (between any two markers) the frequency of both CPU and memo
176 constant. 176 constant.
177 177
178 %Note that 178 %Note that
179 -Our algorithm is not practical for real systems, it knows the characteristics of the 179 +Our algorithm is not practical for real systems, as it knows the characteristics of the
180 future samples and their performance clusters in the beginning of a stable 180 future samples and their performance clusters in the beginning of a stable
181 region. % (and therefore is impractical to implement in real systems). 181 region. % (and therefore is impractical to implement in real systems).
182 We are 182 We are
183 -currently designing algorithms that are capable of tuning the system while 183 +currently designing algorithms in hardware and software that are capable of tuning the system while
184 running the application as future work. In Section~\ref{sec-algo-implications}, we 184 running the application as future work. In Section~\ref{sec-algo-implications}, we
185 propose ways in which length of stable regions and the available settings for a 185 propose ways in which length of stable regions and the available settings for a
186 given region can be predicted for energy management algorithms in real systems. 186 given region can be predicted for energy management algorithms in real systems.
@@ -260,7 +260,7 @@ across benchmarks for multiple cluster thresholds at inefficiency budget of 1.3. @@ -260,7 +260,7 @@ across benchmarks for multiple cluster thresholds at inefficiency budget of 1.3.
260 260
261 \subsection{Energy-Performance Trade-offs} 261 \subsection{Energy-Performance Trade-offs}
262 In this subsection we analyze the energy-performance trade-offs made by our 262 In this subsection we analyze the energy-performance trade-offs made by our
263 -ideal algorithm. We then add tuning cost of our algorithm and compare the 263 +ideal algorithm. We then add the tuning cost of our algorithm and compare the
264 energy performance trade-offs across multiple applications. We study multiple 264 energy performance trade-offs across multiple applications. We study multiple
265 cluster thresholds and an inefficiency budget of 1.3. 265 cluster thresholds and an inefficiency budget of 1.3.
266 266
@@ -300,9 +300,7 @@ frequency transitions. We assume tuning overhead @@ -300,9 +300,7 @@ frequency transitions. We assume tuning overhead
300 of 500us and 30uJ, which includes computing inefficiencies, searching for the 300 of 500us and 30uJ, which includes computing inefficiencies, searching for the
301 optimal setting and transition the hardware to new 301 optimal setting and transition the hardware to new
302 settings~\cite{deng2012coscale}. We assumed that a space of 100 settings is 302 settings~\cite{deng2012coscale}. We assumed that a space of 100 settings is
303 -searched for every transition. \textit{gobmk} is the only benchmark that shows a  
304 -performance improvement from the optimal settings when performance is allowed to  
305 -degrade, which is unexpected. We are investigating its root cause. 303 +searched for every transition.
306 %This is not intuitive and we are investigating the cause of this anomaly 304 %This is not intuitive and we are investigating the cause of this anomaly
307 %\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find 305 %\XXXnote{MH: be careful I would cut this s%entance at a minimum and then find
308 %the reason for the change}. 306 %the reason for the change}.
@@ -317,8 +315,8 @@ samples. This results in longer stable regions. @@ -317,8 +315,8 @@ samples. This results in longer stable regions.
317 stable region. The longer the stable regions, the lower 315 stable region. The longer the stable regions, the lower
318 the number of transitions that the system need to make. 316 the number of transitions that the system need to make.
319 \item Allowing a higher degradation in performance may, in fact, result in improved 317 \item Allowing a higher degradation in performance may, in fact, result in improved
320 -performance when tuning overhead of algorithms is included due to reduction in  
321 -number of frequency transitions in the system. Consequently energy savings also 318 +performance when tuning overhead is included due to reduction in
  319 +number of frequency transitions in the system, consequently energy savings also
322 increase. 320 increase.
323 \end{enumerate} 321 \end{enumerate}
324 322
system_methodology.tex
@@ -12,7 +12,8 @@ Recent @@ -12,7 +12,8 @@ Recent
12 research~\cite{david2011memory,deng2011memscale} has shown that DRAM frequency scaling 12 research~\cite{david2011memory,deng2011memscale} has shown that DRAM frequency scaling
13 also provides performance and energy trade-offs. 13 also provides performance and energy trade-offs.
14 14
15 -In this work, we scale frequency and voltage for the CPU and scale only frequency for memory. 15 +In this work, we scale frequency and voltage for the CPU and scale only
  16 +frequency for the memory~\cite{david2011memory,deng2011memscale}.
16 %In this work, we scale frequency and voltage for the CPU and for the memory, scale frequency only. 17 %In this work, we scale frequency and voltage for the CPU and for the memory, scale frequency only.
17 %to make energy-performance trade-offs. 18 %to make energy-performance trade-offs.
18 %Dynamic Voltage and 19 %Dynamic Voltage and
@@ -42,7 +43,7 @@ to perform our studies. @@ -42,7 +43,7 @@ to perform our studies.
42 Current Gem5 versions provide the infrastructure necessary to change CPU 43 Current Gem5 versions provide the infrastructure necessary to change CPU
43 frequency and voltage; we extended Gem5 DVFS to incorporate memory frequency 44 frequency and voltage; we extended Gem5 DVFS to incorporate memory frequency
44 scaling. As shown in Figure~\ref{fig-system-block-diag}, Gem5 provides a DVFS 45 scaling. As shown in Figure~\ref{fig-system-block-diag}, Gem5 provides a DVFS
45 -controller device that provides interface to control frequency by the OS at 46 +controller device that provides an interface to control frequency by the OS at
46 runtime. We developed a memory frequency governor similar to existing Linux CPU 47 runtime. We developed a memory frequency governor similar to existing Linux CPU
47 frequency governors. 48 frequency governors.
48 %that are capable of tuning memory frequency at runtime. 49 %that are capable of tuning memory frequency at runtime.
@@ -76,14 +77,14 @@ We developed energy models for the CPU and DRAM for our studies. Gem5 comes @@ -76,14 +77,14 @@ We developed energy models for the CPU and DRAM for our studies. Gem5 comes
76 with the energy models for various DRAM chipsets. The 77 with the energy models for various DRAM chipsets. The
77 DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the 78 DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the
78 memory energy consumption periodically during the benchmark execution. However, 79 memory energy consumption periodically during the benchmark execution. However,
79 -Gem5 lacks a model for CPU energy consumption. We developed a processor power 80 +Gem5 lacks a model for CPU energy consumption. We developed a processor power
80 model based on empirical measurements of a PandaBoard~\cite{pandaboard-url} 81 model based on empirical measurements of a PandaBoard~\cite{pandaboard-url}
81 evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9 82 evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9
82 processor; this chipset is used in the mobile platform we want to emulate, the 83 processor; this chipset is used in the mobile platform we want to emulate, the
83 Samsung Nexus S. We ran microbenchmarks designed to stress the PandaBoard to 84 Samsung Nexus S. We ran microbenchmarks designed to stress the PandaBoard to
84 its full utilization and measured power consumed using an Agilent~34411A 85 its full utilization and measured power consumed using an Agilent~34411A
85 multimeter. Because of the limitations of the platform, we could only measure 86 multimeter. Because of the limitations of the platform, we could only measure
86 -peak dynamic power. Therefore to model different voltage levels we scaled it 87 +peak dynamic power. Therefore, to model different voltage levels we scaled it
87 quadratically with voltage and linear with frequency $(P{\propto}V^{2}f)$. Our 88 quadratically with voltage and linear with frequency $(P{\propto}V^{2}f)$. Our
88 peak dynamic power agrees with the numbers reported by previous 89 peak dynamic power agrees with the numbers reported by previous
89 work~\cite{poweragile-hotos11} and the datasheets. 90 work~\cite{poweragile-hotos11} and the datasheets.
@@ -105,22 +106,22 @@ proportional to supply voltage~\cite{leakage-islped02}. @@ -105,22 +106,22 @@ proportional to supply voltage~\cite{leakage-islped02}.
105 %with our CPU power model to compute CPU energy consumption of the application at run time. 106 %with our CPU power model to compute CPU energy consumption of the application at run time.
106 107
107 \subsection{Experimental Methodology} 108 \subsection{Experimental Methodology}
108 -Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run  
109 -on the Gem5 full system simulator. We model a Cortex-A9 processor, single core,  
110 -out-of-order CPU with an issue width of 8, L1 cache size of 64~KB with access  
111 -latency of 2 core cycles and a unified L2 cache of size 2~MB with hit latency of  
112 -12 core cycles. The CPU and caches operate under the same clock domain. For our  
113 -purposes, we have configured the CPU clock domain frequency to have a range of  
114 -100--1000~MHZ with highest voltage being 1.25V. 109 +Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run on
  110 +the Gem5 full system simulator. We use default core configuration provided by
  111 +Gem5 in revision 10585, that is designed to reflect ARM Cortex-A15 processor
  112 +with L1 cache size of 64~KB with access latency of 2 core cycles and a unified
  113 +L2 cache of size 2~MB with hit latency of 12 core cycles. The CPU and caches
  114 +operate under the same clock domain. For our purposes, we have configured the
  115 +CPU clock domain frequency to have a range of 100--1000~MHZ with highest voltage
  116 +being 1.25V.
115 % MH: This might confuse readers 117 % MH: This might confuse readers
116 -%Our  
117 -%experiments with a simple ring oscillator show that voltage changes by  
118 -%0.02V/30MHz. The voltage and frequency pairs match with the frequency steps used  
119 -%by the Nexus S. 118 +%Our experiments with a simple ring oscillator show that voltage changes by
  119 +%0.02V/30MHz. The voltage and frequency pairs match with the frequency steps
  120 +%used by the Nexus S.
120 121
121 For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page 122 For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page
122 policy. Timing and current parameters for LPDDR3 are configured as specified in 123 policy. Timing and current parameters for LPDDR3 are configured as specified in
123 -Micron data sheet~\cite{micronspec-url}. Memory clock domain is configured with a 124 +data sheets from Micron~\cite{micronspec-url}. Memory clock domain is configured with a
124 frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory 125 frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory
125 voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. 126 voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively.
126 127
@@ -137,10 +138,10 @@ benchmarks that have interesting and unique phases. @@ -137,10 +138,10 @@ benchmarks that have interesting and unique phases.
137 %selected benchmarks that have interesting and unique phases with finer 138 %selected benchmarks that have interesting and unique phases with finer
138 %frequency step granularity of 30MHz for CPU and 40MHz for memory, a total of 139 %frequency step granularity of 30MHz for CPU and 40MHz for memory, a total of
139 %496 settings. 140 %496 settings.
140 -Due to limited resources and time, running simulations for all benchmarks with  
141 -finer frequency steps was difficult as it would have resulted in more than  
142 -10,000 simulations, where each simulation would take anywhere between 4 to 12  
143 -hours. 141 +%Due to limited resources and time, running simulations for all benchmarks with
  142 +%finer frequency steps was difficult as it would have resulted in more than
  143 +%10,000 simulations, where each simulation would take anywhere between 4 to 12
  144 +%hours.
144 145
145 We collected samples of a fixed amount of work so that each sample would 146 We collected samples of a fixed amount of work so that each sample would
146 represent the same work even across different frequencies. In Gem5, we collected 147 represent the same work even across different frequencies. In Gem5, we collected
@@ -160,10 +161,10 @@ performance or energy. @@ -160,10 +161,10 @@ performance or energy.
160 Although individual energy-performance trade-offs of DVFS for CPU and 161 Although individual energy-performance trade-offs of DVFS for CPU and
161 DFS for memory have been studied in the past, the trade-off resulting from 162 DFS for memory have been studied in the past, the trade-off resulting from
162 the cross-component interaction of these two components has not been 163 the cross-component interaction of these two components has not been
163 -characterized. CoScale~\cite{deng2012coscale} did point out that 164 +characterized. CoScale~\cite{deng2012coscale} did point out that
164 interplay of performance and energy consumption of these two 165 interplay of performance and energy consumption of these two
165 components is complex and did present a heuristic that attempts to 166 components is complex and did present a heuristic that attempts to
166 -pick the optimal point. However, it did not measure and characterize 167 +pick the optimal point. In the next Section, we measure and characterize
167 the larger space of all system level performance and energy trade-offs 168 the larger space of all system level performance and energy trade-offs
168 of various CPU and memory frequency settings. 169 of various CPU and memory frequency settings.
169 %In the next section, we study how performance and 170 %In the next section, we study how performance and