modeling-allocation-utilization.tex

\documentclass[conference]{IEEEtran}

\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{mathtools}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}


\begin{document}


\title{Modeling Impact of Execution Strategies on Resource Utilization}


\author{\IEEEauthorblockN{Alexey Poyda\IEEEauthorrefmark{1}, Mikhail Titov\IEEEauthorrefmark{1}, Alexei Klimentov\IEEEauthorrefmark{2}, Jack Wells\IEEEauthorrefmark{3}, Sarp Oral\IEEEauthorrefmark{3}, Kaushik De\IEEEauthorrefmark{4}, Danila Oleynik\IEEEauthorrefmark{4}\\ and Shantenu Jha\IEEEauthorrefmark{2}\IEEEauthorrefmark{5}}
\IEEEauthorblockA{\IEEEauthorrefmark{1}National Research Centre ``Kurchatov Institute'', Moscow, Russia}
\IEEEauthorblockA{\IEEEauthorrefmark{2}Brookhaven National Laboratory, Upton, NY, USA}
\IEEEauthorblockA{\IEEEauthorrefmark{3}Oak Ridge National Laboratory, Oak Ridge, TN, USA}
\IEEEauthorblockA{\IEEEauthorrefmark{4}University of Texas at Arlington, Arlington, TX, USA}
\IEEEauthorblockA{\IEEEauthorrefmark{5}Rutgers University, Piscataway, NJ, USA}}


\maketitle

\begin{abstract}
The analysis of the hundreds of petabytes of raw and derived HEP (High Energy Physics) data will necessitate exascale computing. In addition to unprecedented volume, these data are distributed over hundreds of computing centers. In response to these application requirement, as well as performance requirement by using parallel processing (i.e., parallelism), and as a consequence of technology trends, there has been an increase in the uptake of supercomputers by HEP projects.

The increased prevalence of supercomputers has important implications in the execution strategy employed. Because each resource is now a significant workhorse, the importance and sensitivity of correct resource selection for high-utilization and good response time are higher. As the task types using resources of high-throughput computing and high-performance computing by HEP projects become more varied (e.g., reconstruction, simulation, digitization, etc.), better models for generalized resource selection for a wide range of heterogeneous tasks are needed, thus to respond to different heterogeneous workloads. Another important consideration is the selection of resources that are likely to facilitate the utilization of the available hours in a meaningful time window. For example, there is no pointing selecting resources that might enable faster execution time but will not support the appropriate number of core-hours in the time window of interest. Thus, any accurate resource selection strategy must also consider the implications of the load of the supercomputer chosen.

In this short paper, we will study the interplay of the load on the supercomputer with the tasks placed on the supercomputer. To do so we introduce the concept of an execution strategy, which is defined as the set of parameter values that uniquely define the group of jobs to be executed (the number of required nodes per job, required execution time called wall-time per job, and jobs generation rate). The aim is to find execution strategies that maximize the probability of utilizing a certain allocation. Conversely, we seek the execution strategy that will optimize the utilization of a given number of core-hours on a resource. The found execution strategy is applied to the upcoming jobs over a particular time period, and it is called static for that period. Our working hypothesis, is that in the most cases the well-defined static strategy gives a higher probability of successful application than, for example, a random dynamic strategy.

Our proposed solution includes: i) a quantitative model to estimate the probability of a given number of core-hours (i.e., allocation) being utilized; and ii) a simulator that emulates the workload of a supercomputer and produces (synthetic) job traces. Both the quantitative model and the simulator are statistical in nature and are reliable for large job counts over long duration.

The quantitative model is trained by the previous processes of utilization of allocation time. It is represented by the equation which calculates the probability that utilization $U$ during the time interval $T_0$ will reach or exceed the predefined value $U_0$ (see equation \ref{equation_1}), where $f(x, \mu, \sigma^2)$ is a function of probability density of the normal distribution $N(\mu, \sigma^2)$, $\mu$ and $\sigma^2$ are expected value and variance of a random variable describing duration of waiting time in the queue for jobs correspondingly, and the same for a random variable describing utilization of one job ($\mu_U$ and $\sigma_{U}^2$).

\begin{equation}
\label{equation_1}
\begin{multlined}
P(U > U_0) = \sum\limits_{n=100}^{\infty} 
             \bigg[ \int_{U_0}^{\infty}f(x, n\mu_{U}, n\sigma_{U}^2)dx \  \times \\
             \bigg( \int_{-\infty}^{T_0}f(x, n\mu, n\sigma^2)dx \  - \\
             \int_{-\infty}^{T_0}f(x, (n+1)\mu, (n+1)\sigma^2)dx \bigg) \bigg]
\end{multlined}
\end{equation}

Experiments conducted to evaluate the historical data (of the supercomputer Titan) showed the possibility to estimate the optimal time duration for complete utilization of allocated resources.

As a part of a designed solution, we developed a simulator of a supercomputer’s workload. It is based on queueing theory and has the possibility to set requirements and restrictions of a particular supercomputer (e.g., Titan). It is used for model validation and adjustment. Thus, for a given partial workload it simulates the load on a resource (e.g., Titan) and traces of the execution of the jobs with defined job state model (generated, on-hold, pending, starting, execution, finished).

Our research will provide tools that estimate the potential utilization of allocated resources on any supercomputer, given its workload. It can be used to adjust job parameters to regulate and improve the probability of utilizing a given allocation for a given project. Although the models and simulator are preliminary and require further tuning, to understand the accuracy and sensitivity (to initial conditions, training duration, workload types). However, the model and simulator are already being applied successfully to large-scale allocations. We believe the important of job-parameter shaping (tuning) will increase significantly as the number of core-hours consumed by a project increases (and gradually approaches the exascale). Further, this early work will be extended to consider different kinds of workflows as well as different types of workloads of heterogeneous resources. 
\end{abstract}

\end{document}