Commit 2d01fdbaebd55e81ba9255884cba0ad668ad2311

Authored by Rizwana Begum
1 parent 38e0d558

draft: camera ready submission

inefficiency.tex
@@ -151,14 +151,36 @@ We propose two methods for computing $E_{min}$: @@ -151,14 +151,36 @@ We propose two methods for computing $E_{min}$:
151 151
152 \end{itemize} 152 \end{itemize}
153 153
154 -We are working towards designing efficient energy prediction models for CPU,  
155 -memory and network components.  
156 -%  
157 -Our models consider cross-component interactions on performance and energy  
158 -consumption.  
159 -%  
160 -In this work we demonstrate how to use inefficiency, deferring predicting and  
161 -optimizing $E_{min}$ to future work. 154 +%We are working towards designing efficient energy prediction models for CPU,
  155 +%memory and network components.
  156 +%
  157 +%Our models consider cross-component interactions on performance and energy
  158 +%consumption.
  159 +%
  160 +%%%%%%%% MODEL %%%%%%%%%%
  161 +We designed efficient models to predict performance and energy consumption of
  162 +CPU and memory at various voltage and frequency settings for a given
  163 +application. We plan on using these models to estimate $E_{min}$ of a given set
  164 +of instructions.
  165 +%We envision a system capable of scaling voltage and frequency of CPU and only
  166 +%frequency of DRAM.
  167 +Our models consider cross-component interactions on performance and energy.
  168 +Performance model uses hardware performance counters to measure amount of time
  169 +each component is $Busy$ completing the work, $Idle$ stalled on the other
  170 +component and $Waiting$ for more work. We designed systematic methodology to
  171 +scale these states to estimate execution time of a given workload at different
  172 +voltage and frequency settings. In our model, the $Idle$ time of one component
  173 +depends on the settings of the second component. The $Busy$ time of each
  174 +component scales with it's own frequency. However, part of the $Busy$ time that
  175 +overlaps with the other component is constrained by the slowest component.
  176 +
  177 +We combine predicted performance with the power models of CPU and memory
  178 +described in Section~\ref{subsec-energy-models} to estimate energy consumption
  179 +of CPU and memory. Our model has average prediction error of 4\% across SPEC
  180 +CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm
  181 +(24\%)$. In this work we demonstrate how to use inefficiency, deferring
  182 +optimization of $E_{min}$ prediction to future work.
  183 +%%%%% END OF MODEL %%%%%%
162 184
163 \subsection{Managing Inefficiency} 185 \subsection{Managing Inefficiency}
164 % 186 %
@@ -183,4 +205,4 @@ We leave building some of these algorithms into a system as future work. @@ -183,4 +205,4 @@ We leave building some of these algorithms into a system as future work.
183 % 205 %
184 In this paper, we characterize the optimal performance point under different 206 In this paper, we characterize the optimal performance point under different
185 inefficiency constraints and illustrate that the stability of these points 207 inefficiency constraints and illustrate that the stability of these points
186 -have implications for future algorithms. 208 +has implications for future algorithms.
introduction.tex
@@ -32,7 +32,7 @@ energy constraints. @@ -32,7 +32,7 @@ energy constraints.
32 Our work represents two advances over previous efforts. 32 Our work represents two advances over previous efforts.
33 % 33 %
34 First, while previous works have explored energy minimizations using DVFS 34 First, while previous works have explored energy minimizations using DVFS
35 -under performance constraints focusing on reducing slack, we are the first to 35 +under performance constraints focusing on reducing slack~\cite{deng2012coscale}, we are the first to
36 study the potential DVFS settings under an energy constraint. 36 study the potential DVFS settings under an energy constraint.
37 % 37 %
38 Specifying performance constraints for servers is appropriate, since they are 38 Specifying performance constraints for servers is appropriate, since they are
@@ -86,7 +86,7 @@ management algorithms. @@ -86,7 +86,7 @@ management algorithms.
86 % 86 %
87 \end{enumerate} 87 \end{enumerate}
88 88
89 -We use the \texttt{gem5} simulator, the Android smartphone platform and Linux 89 +We use the \texttt{Gem5} simulator, the Android smartphone platform and Linux
90 kernel, and an empirical power model to (1) measure the inefficiency of 90 kernel, and an empirical power model to (1) measure the inefficiency of
91 several applications for a wide range of frequency settings, (2) compute 91 several applications for a wide range of frequency settings, (2) compute
92 performance clusters, and (3) study how they evolve. 92 performance clusters, and (3) study how they evolve.
optimal_performance.tex
@@ -61,7 +61,7 @@ simulation noise, the algorithm selects the settings with highest CPU (first) @@ -61,7 +61,7 @@ simulation noise, the algorithm selects the settings with highest CPU (first)
61 and memory frequency as this setting is bound to have highest performance among 61 and memory frequency as this setting is bound to have highest performance among
62 the other possibilities. 62 the other possibilities.
63 63
64 -Figure~\ref{gobmk-optimal} plots the optimal settings for Gobmk for all 64 +Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all
65 benchmark samples (each of length 10 million instructions) across multiple 65 benchmark samples (each of length 10 million instructions) across multiple
66 inefficiency constraints. At low inefficiencies, the optimal settings follow 66 inefficiency constraints. At low inefficiencies, the optimal settings follow
67 the trends in CPI (cycles per instruction) and MPKI (misses per thousand 67 the trends in CPI (cycles per instruction) and MPKI (misses per thousand
paper.tex
@@ -46,7 +46,17 @@ Geoffrey Challen, Mark Hempstead} @@ -46,7 +46,17 @@ Geoffrey Challen, Mark Hempstead}
46 } 46 }
47 47
48 \else 48 \else
49 -\author{\IEEEauthorblockN{Paper \thepapernumber}\vspace*{-0.1in}} 49 +%\author{\IEEEauthorblockN{Paper \thepapernumber}\vspace*{-0.1in}}
  50 +
  51 +\author{%
  52 + \IEEEauthorblockN{Rizwana Begum, David Werner and Mark Hempstead}
  53 + \IEEEauthorblockA{Drexel University\\
  54 + {\rm \tt{\{rb639,daw77,mhempstead\}@drexel.edu}}}
  55 + \and
  56 + \IEEEauthorblockN{Guru Prasad and Geoffrey Challen}
  57 + \IEEEauthorblockA{University at Buffalo\\
  58 + {\rm \tt \{gurupras,challen\}@buffalo.edu}}
  59 +}
50 60
51 \hypersetup{ 61 \hypersetup{
52 pdfinfo={ 62 pdfinfo={
performance_clusters.tex
@@ -178,7 +178,7 @@ constant. @@ -178,7 +178,7 @@ constant.
178 %Note that 178 %Note that
179 Our algorithm is not practical for real systems, it knows the characteristics of the 179 Our algorithm is not practical for real systems, it knows the characteristics of the
180 future samples and their performance clusters in the beginning of a stable 180 future samples and their performance clusters in the beginning of a stable
181 -region.% (and therefore is impractical to implement in real systems). 181 +region. % (and therefore is impractical to implement in real systems).
182 We are 182 We are
183 currently designing algorithms that are capable of tuning the system while 183 currently designing algorithms that are capable of tuning the system while
184 running the application as future work. In Section~\ref{sec-algo-implications}, we 184 running the application as future work. In Section~\ref{sec-algo-implications}, we
system_methodology.tex
@@ -71,6 +71,7 @@ and degrade performance simultaneously.} @@ -71,6 +71,7 @@ and degrade performance simultaneously.}
71 71
72 72
73 \subsection{Energy Models} 73 \subsection{Energy Models}
  74 +\label{subsec-energy-models}
74 We developed energy models for the CPU and DRAM for our studies. Gem5 comes 75 We developed energy models for the CPU and DRAM for our studies. Gem5 comes
75 with the energy models for various DRAM chipsets. The 76 with the energy models for various DRAM chipsets. The
76 DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the 77 DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the
@@ -79,7 +80,7 @@ Gem5 lacks a model for CPU energy consumption. We developed a processor power @@ -79,7 +80,7 @@ Gem5 lacks a model for CPU energy consumption. We developed a processor power
79 model based on empirical measurements of a PandaBoard~\cite{pandaboard-url} 80 model based on empirical measurements of a PandaBoard~\cite{pandaboard-url}
80 evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9 81 evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9
81 processor; this chipset is used in the mobile platform we want to emulate, the 82 processor; this chipset is used in the mobile platform we want to emulate, the
82 -Samsung Nexus S. We ran microbenchmarks designed to stress the Pandaboard to 83 +Samsung Nexus S. We ran microbenchmarks designed to stress the PandaBoard to
83 its full utilization and measured power consumed using an Agilent~34411A 84 its full utilization and measured power consumed using an Agilent~34411A
84 multimeter. Because of the limitations of the platform, we could only measure 85 multimeter. Because of the limitations of the platform, we could only measure
85 peak dynamic power. Therefore to model different voltage levels we scaled it 86 peak dynamic power. Therefore to model different voltage levels we scaled it
@@ -119,7 +120,7 @@ purposes, we have configured the CPU clock domain frequency to have a range of @@ -119,7 +120,7 @@ purposes, we have configured the CPU clock domain frequency to have a range of
119 120
120 For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page 121 For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page
121 policy. Timing and current parameters for LPDDR3 are configured as specified in 122 policy. Timing and current parameters for LPDDR3 are configured as specified in
122 -micron data sheet~\cite{micronspec-url}. Memory clock domain is configured with a 123 +Micron data sheet~\cite{micronspec-url}. Memory clock domain is configured with a
123 frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory 124 frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory
124 voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. 125 voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively.
125 126
@@ -142,7 +143,7 @@ finer frequency steps was difficult as it would have resulted in more than @@ -142,7 +143,7 @@ finer frequency steps was difficult as it would have resulted in more than
142 hours. 143 hours.
143 144
144 We collected samples of a fixed amount of work so that each sample would 145 We collected samples of a fixed amount of work so that each sample would
145 -represent the same work even across different frequencies. In gem5, we collectd 146 +represent the same work even across different frequencies. In Gem5, we collected
146 performance and energy consumption data every 10~million user mode 147 performance and energy consumption data every 10~million user mode
147 instructions. 148 instructions.
148 %this fixed sample of work makes . 149 %this fixed sample of work makes .