Commit 2d01fdbaebd55e81ba9255884cba0ad668ad2311

Authored by Rizwana Begum
1 parent 38e0d558

draft: camera ready submission

inefficiency.tex
... ... @@ -151,14 +151,36 @@ We propose two methods for computing $E_{min}$:
151 151  
152 152 \end{itemize}
153 153  
154   -We are working towards designing efficient energy prediction models for CPU,
155   -memory and network components.
156   -%
157   -Our models consider cross-component interactions on performance and energy
158   -consumption.
159   -%
160   -In this work we demonstrate how to use inefficiency, deferring predicting and
161   -optimizing $E_{min}$ to future work.
  154 +%We are working towards designing efficient energy prediction models for CPU,
  155 +%memory and network components.
  156 +%
  157 +%Our models consider cross-component interactions on performance and energy
  158 +%consumption.
  159 +%
  160 +%%%%%%%% MODEL %%%%%%%%%%
  161 +We designed efficient models to predict performance and energy consumption of
  162 +CPU and memory at various voltage and frequency settings for a given
  163 +application. We plan on using these models to estimate $E_{min}$ of a given set
  164 +of instructions.
  165 +%We envision a system capable of scaling voltage and frequency of CPU and only
  166 +%frequency of DRAM.
  167 +Our models consider cross-component interactions on performance and energy.
  168 +Performance model uses hardware performance counters to measure amount of time
  169 +each component is $Busy$ completing the work, $Idle$ stalled on the other
  170 +component and $Waiting$ for more work. We designed systematic methodology to
  171 +scale these states to estimate execution time of a given workload at different
  172 +voltage and frequency settings. In our model, the $Idle$ time of one component
  173 +depends on the settings of the second component. The $Busy$ time of each
  174 +component scales with it's own frequency. However, part of the $Busy$ time that
  175 +overlaps with the other component is constrained by the slowest component.
  176 +
  177 +We combine predicted performance with the power models of CPU and memory
  178 +described in Section~\ref{subsec-energy-models} to estimate energy consumption
  179 +of CPU and memory. Our model has average prediction error of 4\% across SPEC
  180 +CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm
  181 +(24\%)$. In this work we demonstrate how to use inefficiency, deferring
  182 +optimization of $E_{min}$ prediction to future work.
  183 +%%%%% END OF MODEL %%%%%%
162 184  
163 185 \subsection{Managing Inefficiency}
164 186 %
... ... @@ -183,4 +205,4 @@ We leave building some of these algorithms into a system as future work.
183 205 %
184 206 In this paper, we characterize the optimal performance point under different
185 207 inefficiency constraints and illustrate that the stability of these points
186   -have implications for future algorithms.
  208 +has implications for future algorithms.
... ...
introduction.tex
... ... @@ -32,7 +32,7 @@ energy constraints.
32 32 Our work represents two advances over previous efforts.
33 33 %
34 34 First, while previous works have explored energy minimizations using DVFS
35   -under performance constraints focusing on reducing slack, we are the first to
  35 +under performance constraints focusing on reducing slack~\cite{deng2012coscale}, we are the first to
36 36 study the potential DVFS settings under an energy constraint.
37 37 %
38 38 Specifying performance constraints for servers is appropriate, since they are
... ... @@ -86,7 +86,7 @@ management algorithms.
86 86 %
87 87 \end{enumerate}
88 88  
89   -We use the \texttt{gem5} simulator, the Android smartphone platform and Linux
  89 +We use the \texttt{Gem5} simulator, the Android smartphone platform and Linux
90 90 kernel, and an empirical power model to (1) measure the inefficiency of
91 91 several applications for a wide range of frequency settings, (2) compute
92 92 performance clusters, and (3) study how they evolve.
... ...
optimal_performance.tex
... ... @@ -61,7 +61,7 @@ simulation noise, the algorithm selects the settings with highest CPU (first)
61 61 and memory frequency as this setting is bound to have highest performance among
62 62 the other possibilities.
63 63  
64   -Figure~\ref{gobmk-optimal} plots the optimal settings for Gobmk for all
  64 +Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all
65 65 benchmark samples (each of length 10 million instructions) across multiple
66 66 inefficiency constraints. At low inefficiencies, the optimal settings follow
67 67 the trends in CPI (cycles per instruction) and MPKI (misses per thousand
... ...
paper.tex
... ... @@ -46,7 +46,17 @@ Geoffrey Challen, Mark Hempstead}
46 46 }
47 47  
48 48 \else
49   -\author{\IEEEauthorblockN{Paper \thepapernumber}\vspace*{-0.1in}}
  49 +%\author{\IEEEauthorblockN{Paper \thepapernumber}\vspace*{-0.1in}}
  50 +
  51 +\author{%
  52 + \IEEEauthorblockN{Rizwana Begum, David Werner and Mark Hempstead}
  53 + \IEEEauthorblockA{Drexel University\\
  54 + {\rm \tt{\{rb639,daw77,mhempstead\}@drexel.edu}}}
  55 + \and
  56 + \IEEEauthorblockN{Guru Prasad and Geoffrey Challen}
  57 + \IEEEauthorblockA{University at Buffalo\\
  58 + {\rm \tt \{gurupras,challen\}@buffalo.edu}}
  59 +}
50 60  
51 61 \hypersetup{
52 62 pdfinfo={
... ...
performance_clusters.tex
... ... @@ -178,7 +178,7 @@ constant.
178 178 %Note that
179 179 Our algorithm is not practical for real systems, it knows the characteristics of the
180 180 future samples and their performance clusters in the beginning of a stable
181   -region.% (and therefore is impractical to implement in real systems).
  181 +region. % (and therefore is impractical to implement in real systems).
182 182 We are
183 183 currently designing algorithms that are capable of tuning the system while
184 184 running the application as future work. In Section~\ref{sec-algo-implications}, we
... ...
system_methodology.tex
... ... @@ -71,6 +71,7 @@ and degrade performance simultaneously.}
71 71  
72 72  
73 73 \subsection{Energy Models}
  74 +\label{subsec-energy-models}
74 75 We developed energy models for the CPU and DRAM for our studies. Gem5 comes
75 76 with the energy models for various DRAM chipsets. The
76 77 DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the
... ... @@ -79,7 +80,7 @@ Gem5 lacks a model for CPU energy consumption. We developed a processor power
79 80 model based on empirical measurements of a PandaBoard~\cite{pandaboard-url}
80 81 evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9
81 82 processor; this chipset is used in the mobile platform we want to emulate, the
82   -Samsung Nexus S. We ran microbenchmarks designed to stress the Pandaboard to
  83 +Samsung Nexus S. We ran microbenchmarks designed to stress the PandaBoard to
83 84 its full utilization and measured power consumed using an Agilent~34411A
84 85 multimeter. Because of the limitations of the platform, we could only measure
85 86 peak dynamic power. Therefore to model different voltage levels we scaled it
... ... @@ -119,7 +120,7 @@ purposes, we have configured the CPU clock domain frequency to have a range of
119 120  
120 121 For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page
121 122 policy. Timing and current parameters for LPDDR3 are configured as specified in
122   -micron data sheet~\cite{micronspec-url}. Memory clock domain is configured with a
  123 +Micron data sheet~\cite{micronspec-url}. Memory clock domain is configured with a
123 124 frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory
124 125 voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively.
125 126  
... ... @@ -142,7 +143,7 @@ finer frequency steps was difficult as it would have resulted in more than
142 143 hours.
143 144  
144 145 We collected samples of a fixed amount of work so that each sample would
145   -represent the same work even across different frequencies. In gem5, we collectd
  146 +represent the same work even across different frequencies. In Gem5, we collected
146 147 performance and energy consumption data every 10~million user mode
147 148 instructions.
148 149 %this fixed sample of work makes .
... ...