Remote GPU virtualization offers an alluring means to increase utilization of the GPUs installed in a cluster, which can potentially yield a faster amortization of the total costs of ownership (TCO). Concretely, GPU virtualization logically decouples the GPUs in the cluster from the nodes they are located in, opening a path to share the accelerators among all the applications that request GPGPU services, independently of whether the node(s) these applications are mapped to are equipped with a GPU or not. In this manner the amount of these accelerators can be reduced, and their utilization rate can be significantly improved.
SLURM can use a generic resource plug-in (GRes) to manage GPUs. With this solution the hardware accelerators, like the GPUs, can only be accessed by the job that is in execution on the node to which the GPU is attached. This is a serious constraint for remote GPU virtualization technologies, which aim to provide a completely user-transparent access to all GPUs in cluster, independently of the specific locations of the application node and the GPU node.
In this work we introduce a new type of resource in SLURM, the remote GPU (rGPU), in order to gain access from any application node to any GPU node in the cluster using rCUDA as the remote GPU virtualization solution.
With this new resource, users can access all GPUs needed for their jobs, as SLURM schedules the task taking into account all the GPUs available in the whole cluster. In other words, introducing GPU-virtualization aware mechanism into SLURM allow applications to execute CUDA kernels in all GPUs, independently of their location.
The modifications needed to extend the SLURM-rCUDA suites with the sought-after functionality comprises the following list:
- New attributes were added to several data structures in SLURM in order to maintain information about the GPUs which is requeried by jobs, partitions and nodes.
- New options for submiting jobs which need rGPU resources were included. The new options allow the users to configure better these kind of jobs.
- We introduced in the SLURM configuration file several new fields with options by default for submitting rCUDA jobs.
- The GRes module of SLURM was modified to allow that GPUs in the cluster are accessible to all nodes, which implies that GPUs will be shared among the nodes. This module manages the allocations and deallocations of generic resources such as the GPUs.
- Several new SLURM plug-ins were implemented. A GRes plug-in "gres/rgpu" declares a new generic resource in the system: the remote GPUs. Three select plug-ins are responsible for rGPU job resource selection and scheduling. The code of these plug-ins is based on the "select/cons_res" plug-in, and therefore similar behaviour can be expected from them.
- Additional fields were introduced in the RPC packages in order to transfer the rGPU information used by SLURM to schedule the jobs.
- Finally, two rCUDA environment variables had to be set during the scheduling, in order to enable the use of rCUDA software:
- RCUDA DEVICE COUNT, used by an rCUDA client to learn how many GPUs exist.
- RCUDA DEVICE X, used by an rCUDA client to know the IP of the nodes where the rGPUs are installed.
After these changes, SLURM allows the user to submit jobs to the queue in three different working modes:
- Original (SLURM): The behavior of SLURM is analogous to version 14.11.0
- Exclusive (rCUDAex): SLURM decouples GPUs from nodes, but they remain accessible only to one job at a time.
- Shared (rCUDAsh): Analogous to the previous mode except that now GPUs can be shared by several jobs. This mode is automatically enabled whenever a certain amount of GPU memory is requested.
We have also performed an extensive evaluation to confirm that the original behaviour of SLURM has not been modified; testing the new functionalities of our SLURM version; as well as the real possibilities of reducing the number of GPUs installed on a cluster and its impact on performance.
Our current version of SLURM adopts scheduling decisions involving remote GPUs based on both the amount of GPU memory required by the job and the minimum CUDA capability version needed.
As part of future work we plan to experiment with different scheduling algorithms, in order to take into account the actual computational intensity of the workload jobs, the GPU computational power, and the “network distance” between the application nodes and the remote GPU nodes to be assigned to a job.
Title: Managing Virtualized Remote GPUs with SLURM
Universitat Jaume I de Castelló (Sergio Iserte, Adrian Castelló, Rafael Mayo, Enrique S. Quintana-Ortí)
Universitat Politècnica de València (Federico Silla, Jose Duato)
Speaker : Sergio Iserte (Universitat Jaume I de Castelló)
Date : 6th February 2015
Location : Faculty of Chemistry (UB), C/ Martí i Franqués 1, 08028 Barcelona, Spain.