system_methodology.tex 9.55 KB
\section{System and Methodology}
\label{sec-sys-methodology}
Energy management algorithms must tune the underlying hardware components to
keep the system within the given inefficiency budget. Hardware components
provide multiple knobs that can be tuned to trade-off performance for energy
savings. For example, the energy consumed by the CPU can be managed by tuning
its frequency and voltage. 
%DRAM energy can be
%saved by using various low power modes and by scheduling requests intelligently to
%cause less memory activity eventually reducing the overall energy. 
Recent
research~\cite{david2011memory,deng2011memscale} has shown that DRAM frequency scaling
also provides performance and energy trade-offs. 

In this work, we scale frequency and voltage for the CPU and scale only frequency for memory.
%In this work, we scale frequency and voltage for the CPU and for the memory, scale frequency only.
%to make energy-performance trade-offs. 
%Dynamic Voltage and
%Frequency Scaling (DVFS) for CPU is a well established technique, while 
Dynamic Frequency Scaling (DFS) for memory has emerged as a means to trade-off
performance for energy savings. 
%In these systems, the bus interface between the
%memory controller and the DRAM is scaled in-order to save energy. Scaling the
%voltage could result in data corruption since the memory array itself is
%asynchronous.  
As no current hardware systems support memory frequency scaling,
we resort to Gem5~\cite{Binkert:gem5}, a cycle-accurate full system simulator
%as a platform
to perform our studies.

\subsection{System Overview}
\begin{figure}[t]
\centering
    \includegraphics[width=0.75\columnwidth]{./figures/plots/systemBlockDiagram.pdf}
    \caption{\textbf{System Block Diagram}: Blocks that are newly added or
    significantly modified from Gem5 origin implementation are shaded.} 
    \label{fig-system-block-diag}
\end{figure}

%We envision a system that consists of a CPU capable of tuning its voltage and
%frequency and memory that supports frequency scaling. 
Current Gem5 versions provide the infrastructure necessary to change CPU
frequency and voltage; we extended Gem5 DVFS to incorporate memory frequency
scaling. As shown in Figure~\ref{fig-system-block-diag}, Gem5 provides a DVFS
controller device that provides interface to control frequency by the OS at
runtime. We developed a memory frequency governor similar to existing Linux CPU
frequency governors.
%that are capable of tuning memory frequency at runtime. 
The blocks that we added or significantly modified from Gem5's original
implementation are shaded in Figure~\ref{fig-system-block-diag}.

\begin{figure*}[t]
    \begin{subfigure}[t]{\textwidth}
	    \centering
    	\includegraphics[width=\columnwidth,height=0.15\paperheight]{./figures/plots/496/speedup_inefficiency/heatmap_inefficiency.pdf}
        \label{heatmap-ineff}
    \end{subfigure}%
    \vspace{-1.2em}
    \newline
    \begin{subfigure}[t]{\textwidth}
	    \centering
        \includegraphics[width=\columnwidth,height=0.15\paperheight]{./figures/plots/496/speedup_inefficiency/heatmap_speedup.pdf}
        \label{heatmap-speedup}
    \end{subfigure}%
\caption{\textbf{Inefficiency vs. Speedup For Multiple Applications:} In
general, performance improves with increasing inefficiency budgets.  A poorly
designed algorithm may select bad frequency settings which could waste energy
and degrade performance simultaneously.}
\label{heatmaps}
\end{figure*}


\subsection{Energy Models}
We developed energy models for the CPU and DRAM for our studies. Gem5 comes
with the energy models for various DRAM chipsets. The
DRAMPower~\cite{drampower-tool} model is integrated into Gem5 and computes the
memory energy consumption periodically during the benchmark execution. However,
Gem5 lacks a model for CPU energy consumption.  We developed a processor power
model based on empirical measurements of a PandaBoard~\cite{pandaboard-url}
evaluation board. The board includes a OMAP4430~chipset with a Cortex~A9
processor; this chipset is used in the mobile platform we want to emulate, the
Samsung Nexus S. We ran microbenchmarks designed to stress the Pandaboard to
its full utilization and measured power consumed using an Agilent~34411A
multimeter.  Because of the limitations of the platform, we could only measure
peak dynamic power. Therefore to model different voltage levels we scaled it
quadratically with voltage and linear with frequency $(P{\propto}V^{2}f)$. Our
peak dynamic power agrees with the numbers reported by previous
work~\cite{poweragile-hotos11} and the datasheets. 

We split the power consumption into three categories: dynamic power, background
power, and leakage power.  Background power is consumed by idle units when the
processor is not computing, but unlike leakage power, background power scales
with clock frequency.  We measure background power by calculating the
difference between the CPU power consumption in its power on idle state and
deep sleep mode (not clocked).  Because background power is clocked, it is
scaled in a similar manner to dynamic power.  Leakage power comprises up to
30\% of microprocessor peak power consumption~\cite{power7} and is linearly
proportional to supply voltage~\cite{leakage-islped02}. 

%We integrated our CPU power model into Gem5. \XXXnote{This isn't grammatically correct and I(Dave)
%don't know what it is trying to say} Gem5 provides statistics
%to measure CPU cycles that are spent executing instructions, cycles that are clocked
%but idle and cycles during which CPU is not clocked. We combine these statistics
%with our CPU power model to compute CPU energy consumption of the application at run time. 

\subsection{Experimental Methodology}
Our simulation infrastructure is based on Android~4.1.1 ``Jelly Bean'' run
on the Gem5 full system simulator. We model a Cortex-A9 processor, single core,
out-of-order CPU  with an issue width of 8, L1 cache size of 64~KB with access
latency of 2 core cycles and a unified L2 cache of size 2~MB with hit latency of
12 core cycles. The CPU and caches operate under the same clock domain. For our
purposes, we have configured the CPU clock domain frequency to have a range of
100--1000~MHZ with highest voltage being 1.25V. 
% MH: This might confuse readers
%Our
%experiments with a simple ring oscillator show that voltage changes by
%0.02V/30MHz. The voltage and frequency pairs match with the frequency steps used
%by the Nexus S. 

For the memory system, we simulated a LPDDR3 single channel, one rank memory access using an open-page
policy. Timing and current parameters for LPDDR3 are configured as specified in
micron data sheet~\cite{micronspec-url}. Memory clock domain is configured with a
frequency range of 200MHz to 800MHz. As mentioned earlier, we did not scale memory
voltage. The power supplies---VDD and VDD2---for LPDDR3 are fixed at 1.8V and 1.2V respectively. 

We first simulated 12 integer and 9 floating point SPEC CPU2006
benchmarks~\cite{henning2006spec}, with each benchmark either running to
completion or up to 2~billion instructions. We booted the system and then changed
CPU and memory frequency using userspace frequency governors before starting
the benchmark. We ran 70 simulations for each benchmark, with a combination of
10 CPU and 7 memory frequency steps using step size of 100MHz.  To study the
finer details of workload phases, we then ran a total of 496 simulations with a
finer step granularity of 30MHz for CPU and 40MHz for memory for selected
benchmarks that have interesting and unique phases.
%To study the fine details of workload phases, we then ran simulations for
%selected benchmarks that have interesting and unique phases with finer
%frequency step granularity of 30MHz for CPU and 40MHz for memory, a total of
%496 settings. 
Due to limited resources and time, running simulations for all benchmarks with
finer frequency steps was difficult as it would have resulted in more than
10,000 simulations, where each simulation would take anywhere between 4 to 12
hours. 

We collected samples of a fixed amount of work so that each sample would
represent the same work even across different frequencies. In gem5, we collectd
performance and energy consumption data every 10~million user mode
instructions.
%this fixed sample of work makes . 
%By collecting data for a fixed amount of work (instructions) we are able to study frequency scaling for workloads; the alternative sampling in time . 
Gem5 provides a mechanism to distinguish between user mode and
kernel mode instructions. We used this feature to remove periodic OS traffic and enable a fair comparison
across simulations of different CPU and memory frequencies. We used the collected
performance and energy data to study the impact of workload dynamics on the
stability of CPU and memory frequency settings delivering best performance under
a given inefficiency budget. Note that, all our studies are performed using
\textit{measured} performance and power data from the simulations, we do not \textit{predict}
performance or energy.

Although individual energy-performance trade-offs of DVFS for CPU and
DFS for memory have been studied in the past, the trade-off resulting from
the cross-component interaction of these two components has not been
characterized.  CoScale~\cite{deng2012coscale} did point out that
interplay of performance and energy consumption of these two
components is complex and did present a heuristic that attempts to
pick the optimal point. However, it did not measure and characterize
the larger space of all system level performance and energy trade-offs
of various CPU and memory frequency settings.
%In the next section, we study how performance and
%inefficiency of applications varies with CPU and memory frequencies.