The NVIDIA device plug-in framework began allowing time-sliced GPU sharing amongst CUDA workloads for containers on Kubernetes with the v0.12 release. By utilising time-multiplexed CUDA contexts, this technology seeks to reduce the under-utilization of GPU units and facilitate application scaling. Such temporal concurrency was possible before the plug-formal in’s release thanks to a fork.
Compute kernels given by various CUDA contexts are automatically serialised by NVIDIA GPUs (i.e., functions run on the hardware). Although streams are only accessible within a single process, the CUDA stream API can be utilised as a concurrent abstraction. Directly running parallel tasks from different processes (such as numerous copies of a server application) always results in an inadequate use of GPU resources.
For multi-process GPU workloads on GPU workstations, CUDA API provides four different types of concurrency: CUDA Multi-Process Service (MPS), Multi-instance GPU (MIG), vGPU, and time-slicing. Since the scheduling API advertises NVIDIA GPU devices as separate integer resources, oversubscription of these devices is prohibited on Kubernetes. Particularly when several CUDA contexts (i.e., various applications) are unable to optimally share current GPUs, this results in a scalability bottleneck for HPC and ML architectures. Even while the NVIDIA VISIBLE DEVICES environment variable can be set to “all,” this configuration is not controlled by the CUDA handlers and could have unintended effects for the Ops teams.
Through the device plug-in, NVIDIA began integrating the native concurrency mechanisms into the clusters as Kubernetes evolved into the standard platform for scaling services. The K8s device plug-in already supports multi-instance GPU concurrency for Ampere and later GPU types (such the A100). The most recent entry on the list offers temporal concurrency through the time-slicing API. On the other hand, the plug-in team has not yet created MPS support for Volta and newer GPU architectures.
Because many CUDA contexts can be managed concurrently, serialised execution of these contexts actually results in temporally concurrent execution. So why should time-slicing API be chosen, one would wonder. Context scheduling requires significant computational resources because it is launched separately from various host processes (CPU). Additionally, anticipating the quantity of CUDA runs may enable potential higher-level optimizations.
With businesses use less expensive lower-end GPUs for inference workloads, the time-slicing API is particularly helpful for ML-serving apps. Common inference GPUs like the NVIDIA T4 cannot be used with MIG on Kubernetes since MPS and MIG are only available for GPUs starting with the Volta and Ampere architectures, respectively. Future libraries attempting to maximise accelerator utilisation will be dependent on time-slicing API.
A simple way to allow temporal concurrency is to include an additional configuration in the manifest YAML file. The following option, for instance, will multiply the number of virtual time-shared devices by a factor of 5, so that for every four GPU devices, 20 will be made available for sharing on Kubernetes:
For its ROCm APIs, AMD maintains a separate k8s device plug-in repository. For instance, the GPU workstation’s ROCm OpenCL API supports hardware queues for concurrent kernel execution, although K8s are also subject to the same restrictions. Regardless of the vendor type, we might see efforts to standardise the Kubernetes platform’s GPU-sharing techniques in the future.
The official documentation offers more details on the concurrency techniques, while the plug-in documentation provides a thorough explanation of the time-slicing API. The official repository also has a different MPS-based k8s vGPU implementation from AWS for Volta and Ampere architectures.