Commit 2d01fdbaebd55e81ba9255884cba0ad668ad2311
1 parent
38e0d558
draft: camera ready submission
Showing
6 changed files
with
50 additions
and
17 deletions
inefficiency.tex
| ... | ... | @@ -151,14 +151,36 @@ We propose two methods for computing $E_{min}$: |
| 151 | 151 | |
| 152 | 152 | \end{itemize} |
| 153 | 153 | |
| 154 | -We are working towards designing efficient energy prediction models for CPU, | |
| 155 | -memory and network components. | |
| 156 | -% | |
| 157 | -Our models consider cross-component interactions on performance and energy | |
| 158 | -consumption. | |
| 159 | -% | |
| 160 | -In this work we demonstrate how to use inefficiency, deferring predicting and | |
| 161 | -optimizing $E_{min}$ to future work. | |
| 154 | +%We are working towards designing efficient energy prediction models for CPU, | |
| 155 | +%memory and network components. | |
| 156 | +% | |
| 157 | +%Our models consider cross-component interactions on performance and energy | |
| 158 | +%consumption. | |
| 159 | +% | |
| 160 | +%%%%%%%% MODEL %%%%%%%%%% | |
| 161 | +We designed efficient models to predict performance and energy consumption of | |
| 162 | +CPU and memory at various voltage and frequency settings for a given | |
| 163 | +application. We plan on using these models to estimate $E_{min}$ of a given set | |
| 164 | +of instructions. | |
| 165 | +%We envision a system capable of scaling voltage and frequency of CPU and only | |
| 166 | +%frequency of DRAM. | |
| 167 | +Our models consider cross-component interactions on performance and energy. | |
| 168 | +Performance model uses hardware performance counters to measure amount of time | |
| 169 | +each component is $Busy$ completing the work, $Idle$ stalled on the other | |
| 170 | +component and $Waiting$ for more work. We designed systematic methodology to | |
| 171 | +scale these states to estimate execution time of a given workload at different | |
| 172 | +voltage and frequency settings. In our model, the $Idle$ time of one component | |
| 173 | +depends on the settings of the second component. The $Busy$ time of each | |
| 174 | +component scales with it's own frequency. However, part of the $Busy$ time that | |
| 175 | +overlaps with the other component is constrained by the slowest component. | |
| 176 | + | |
| 177 | +We combine predicted performance with the power models of CPU and memory | |
| 178 | +described in Section~\ref{subsec-energy-models} to estimate energy consumption | |
| 179 | +of CPU and memory. Our model has average prediction error of 4\% across SPEC | |
| 180 | +CPU2006 benchmarks with highest error of 10\% except for $gobmk (18\%)$ and $lbm | |
| 181 | +(24\%)$. In this work we demonstrate how to use inefficiency, deferring | |
| 182 | +optimization of $E_{min}$ prediction to future work. | |
| 183 | +%%%%% END OF MODEL %%%%%% | |
| 162 | 184 | |
| 163 | 185 | \subsection{Managing Inefficiency} |
| 164 | 186 | % |
| ... | ... | @@ -183,4 +205,4 @@ We leave building some of these algorithms into a system as future work. |
| 183 | 205 | % |
| 184 | 206 | In this paper, we characterize the optimal performance point under different |
| 185 | 207 | inefficiency constraints and illustrate that the stability of these points |
| 186 | -have implications for future algorithms. | |
| 208 | +has implications for future algorithms. | ... | ... |
introduction.tex
| ... | ... | @@ -32,7 +32,7 @@ energy constraints. |
| 32 | 32 | Our work represents two advances over previous efforts. |
| 33 | 33 | % |
| 34 | 34 | First, while previous works have explored energy minimizations using DVFS |
| 35 | -under performance constraints focusing on reducing slack, we are the first to | |
| 35 | +under performance constraints focusing on reducing slack~\cite{deng2012coscale}, we are the first to | |
| 36 | 36 | study the potential DVFS settings under an energy constraint. |
| 37 | 37 | % |
| 38 | 38 | Specifying performance constraints for servers is appropriate, since they are |
| ... | ... | @@ -86,7 +86,7 @@ management algorithms. |
| 86 | 86 | % |
| 87 | 87 | \end{enumerate} |
| 88 | 88 | |
| 89 | -We use the \texttt{gem5} simulator, the Android smartphone platform and Linux | |
| 89 | +We use the \texttt{Gem5} simulator, the Android smartphone platform and Linux | |
| 90 | 90 | kernel, and an empirical power model to (1) measure the inefficiency of |
| 91 | 91 | several applications for a wide range of frequency settings, (2) compute |
| 92 | 92 | performance clusters, and (3) study how they evolve. | ... | ... |
optimal_performance.tex
| ... | ... | @@ -61,7 +61,7 @@ simulation noise, the algorithm selects the settings with highest CPU (first) |
| 61 | 61 | and memory frequency as this setting is bound to have highest performance among |
| 62 | 62 | the other possibilities. |
| 63 | 63 | |
| 64 | -Figure~\ref{gobmk-optimal} plots the optimal settings for Gobmk for all | |
| 64 | +Figure~\ref{gobmk-optimal} plots the optimal settings for $gobmk$ for all | |
| 65 | 65 | benchmark samples (each of length 10 million instructions) across multiple |
| 66 | 66 | inefficiency constraints. At low inefficiencies, the optimal settings follow |
| 67 | 67 | the trends in CPI (cycles per instruction) and MPKI (misses per thousand | ... | ... |
paper.tex
| ... | ... | @@ -46,7 +46,17 @@ Geoffrey Challen, Mark Hempstead} |
| 46 | 46 | } |
| 47 | 47 | |
| 48 | 48 | \else |
| 49 | -\author{\IEEEauthorblockN{Paper \thepapernumber}\vspace*{-0.1in}} | |
| 49 | +%\author{\IEEEauthorblockN{Paper \thepapernumber}\vspace*{-0.1in}} | |
| 50 | + | |
| 51 | +\author{% | |
| 52 | + \IEEEauthorblockN{Rizwana Begum, David Werner and Mark Hempstead} | |
| 53 | + \IEEEauthorblockA{Drexel University\\ | |
| 54 | + {\rm \tt{\{rb639,daw77,mhempstead\}@drexel.edu}}} | |
| 55 | + \and | |
| 56 | + \IEEEauthorblockN{Guru Prasad and Geoffrey Challen} | |
| 57 | + \IEEEauthorblockA{University at Buffalo\\ | |
| 58 | + {\rm \tt \{gurupras,challen\}@buffalo.edu}} | |
| 59 | +} | |
| 50 | 60 | |
| 51 | 61 | \hypersetup{ |
| 52 | 62 | pdfinfo={ | ... | ... |
performance_clusters.tex
| ... | ... | @@ -178,7 +178,7 @@ constant. |
| 178 | 178 | %Note that |
| 179 | 179 | Our algorithm is not practical for real systems, it knows the characteristics of the |
| 180 | 180 | future samples and their performance clusters in the beginning of a stable |
| 181 | -region.% (and therefore is impractical to implement in real systems). | |
| 181 | +region. % (and therefore is impractical to implement in real systems). | |
| 182 | 182 | We are |
| 183 | 183 | currently designing algorithms that are capable of tuning the system while |
| 184 | 184 | running the application as future work. In Section~\ref{sec-algo-implications}, we | ... | ... |
system_methodology.tex
| ... | ... | @@ -71,6 +71,7 @@ and degrade performance simultaneously.} |
| 71 | 71 | |
| 72 | 72 | |
| 73 | 73 | \subsection{Energy Models} |
| 74 | +\label{subsec-energy-models} | |
| 74 | 75 | We developed energy models for the CPU and DRAM for our studies. Gem5 comes |
| 75 | 76 | with the energy models for various DRAM chipsets. The |
| 76 | 77 | DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the |
| ... | ... | @@ -79,7 +80,7 @@ Gem5 lacks a model for CPU energy consumption. We developed a processor power |
| 79 | 80 | model based on empirical measurements of a PandaBoard~\cite{pandaboard-url} |
| 80 | 81 | evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9 |
| 81 | 82 | processor; this chipset is used in the mobile platform we want to emulate, the |
| 82 | -Samsung Nexus S. We ran microbenchmarks designed to stress the Pandaboard to | |
| 83 | +Samsung Nexus S. We ran microbenchmarks designed to stress the PandaBoard to | |
| 83 | 84 | its full utilization and measured power consumed using an Agilent~34411A |
| 84 | 85 | multimeter. Because of the limitations of the platform, we could only measure |
| 85 | 86 | peak dynamic power. Therefore to model different voltage levels we scaled it |
| ... | ... | @@ -119,7 +120,7 @@ purposes, we have configured the CPU clock domain frequency to have a range of |
| 119 | 120 | |
| 120 | 121 | For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page |
| 121 | 122 | policy. Timing and current parameters for LPDDR3 are configured as specified in |
| 122 | -micron data sheet~\cite{micronspec-url}. Memory clock domain is configured with a | |
| 123 | +Micron data sheet~\cite{micronspec-url}. Memory clock domain is configured with a | |
| 123 | 124 | frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory |
| 124 | 125 | voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. |
| 125 | 126 | |
| ... | ... | @@ -142,7 +143,7 @@ finer frequency steps was difficult as it would have resulted in more than |
| 142 | 143 | hours. |
| 143 | 144 | |
| 144 | 145 | We collected samples of a fixed amount of work so that each sample would |
| 145 | -represent the same work even across different frequencies. In gem5, we collectd | |
| 146 | +represent the same work even across different frequencies. In Gem5, we collected | |
| 146 | 147 | performance and energy consumption data every 10~million user mode |
| 147 | 148 | instructions. |
| 148 | 149 | %this fixed sample of work makes . | ... | ... |