How to Increase GPU Utilization in Kubernetes with NVIDIA MPS | by Michele Zanotti | Feb, 2023

By Jessie Hobb On Feb 3, 2023

Integrating NVIDIA Multi-Process Service (MPS) in Kubernetes to share GPUs among workloads for maximizing utilization and reducing infrastructure costs

Most workloads do not require the full memory and computing resources of each GPU. Therefore, sharing a GPU among multiple processes is essential to increase GPU utilization and reduce infrastructure costs.

In Kubernetes, this can be achieved by exposing a single GPU as multiple resources (i.e. slices) of a specific memory and compute size that can be requested by individual containers. By creating GPU slices of the size strictly needed by each container, you can free up resources in the cluster. These resources can be used to schedule additional Pods, or can allow you to reduce the number of nodes of the cluster. In either case, sharing GPUs among processes enables you to reduce infrastructure costs.

GPU support in Kubernetes is provided by the NVIDIA Kubernetes Device Plugin, which at the moment supports only two sharing strategies: time-slicing and Multi-Instance GPU (MIG). However, there is a third GPU sharing strategy that balances the advantages and disadvantages of time-slicing and MIG: Multi-Process Service (MPS). Although MPS is not supported by NVIDIA Device Plugin, there is a way to use it in Kubernetes.

In this article, we will first examine the benefits and drawbacks of all the three GPU sharing technologies, and then provide a step-by-step guide on how to use MPS in Kubernetes. Additionally, we present a solution for automating management of MPS resources for optimizing utilization and reducing operational costs: Dynamic MPS Partitioning.

There are three approaches for sharing GPUs:

Time slicing
Multi-instance GPU (MIG)
Multi-Process Service (MPS)

Let’s take an overview of these technologies before diving into the demo of Dynamic MPS Partitioning.

Time-slicing is a mechanism that allows workloads that land on oversubscribed GPUs to interleave with one another. Time-slicing leverages the GPU time-slicing scheduler, which executes multiple CUDA processes concurrently via temporal sharing.

When time-slicing is activated, the GPU shares its compute resources among the different processes in a fair-sharing manner by switching between processes at regular intervals of time. This generates a computing time overhead related to the continuous context switching, which translates into jitter and higher latency.

Time-slicing is supported by basically every GPU architecture and is the simplest solution for sharing a GPU in a Kubernetes cluster. However, constant switching among processes creates a computation time overhead. Also, time-slicing does not provide any level of memory isolation among the processes sharing a GPU, nor any memory allocation limits, which can lead to frequent Out-Of-Memory (OOM) errors.

If you want to use time-slicing in Kubernetes, all you have to do is edit the NVIDIA Device Plugin configuration. For example, you can apply the configuration below to a node with 2 GPUs. The device plugin running on that node will advertise 8 nvidia.com/gpu resources to Kubernetes, rather than 2. This allows each GPU to be shared by a maximum of 4 containers.

version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4

For more information about time-slicing partitioning in Kubernetes refer to the NVIDIA GPU Operator documentation.

Multi-Instance GPU (MIG) is a technology available on NVIDIA Ampere and Hopper architectures that allows to securely partition a GPU into up to seven separate GPU instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores.

The isolated GPU slices are called MIG devices, and they are named adopting a format that indicates the compute and memory resources of the device. For example, 2g.20gb corresponds to a GPU slice with 20 GB of memory.

MIG does not allow to create GPU slices of custom sizes and quantity, as each GPU model only supports a specific set of MIG profiles. This reduces the degree of granularity with which you can partition the GPUs. Additionally, the MIG devices must be created respecting certain placement rules, which further limits flexibility of use.

MIG is the GPU sharing approach that offers the highest level of isolation among processes. However, it lacks flexibility and it is compatible only with few GPU architectures (Ampere and Hopper).

You can create and delete MIG devices manually with the nvidia-smi CLI or programmatically with NVML. The devices are then exposed as Kubernetes resources by the NVIDIA Device Plugin using different naming strategies. For instance, using the mixed strategy, the device 1g.10gb is exposed as nvidia.com/mig-1g.10gb. Instead the strategy single exposes the device as a generic nvidia.com/gpu resource.

Managing MIG devices manually with the nvidia-smi CLI or with NVML is rather impractical: in Kubernetes the NVIDIA GPU Operator offers an easier way to use MIG, though still with limitations. The operator uses a ConfigMap defining a set of allowed MIG configurations that you can apply to each node by tagging it with a label.

You can edit this ConfigMap to define your own custom MIG configurations, as in the example shown below. In this example, a node is labeled with nvidia.com/mig.config=all-1g.5gb. Therefore, the GPU Operator will partition each GPU of that node into seven 1g.5gb MIG devices, which are then exposed to Kubernetes as nvidia.com/mig-1g.5gb resources.

apiVersion: v1
kind: ConfigMap
metadata:
name: default-mig-parted-config
data:
config.yaml: |
version: v1
mig-configs:
all-1g.5gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.5gb": 7
all-2g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"2g.10gb": 3

To make efficient use of the resources in the cluster with NVIDIA GPU Operator, the cluster admin would have to continuously modify the ConfigMap to adapt the MIG size to the ever-changing workload compute requirements.

This is very impractical. Although this approach is certainly better than SSH-ing to nodes and manually creating/deleting of MIG devices, it is very labor and time-consuming for the cluster admin. Therefore, it is often the case that the configuration of MIG devices is rarely changed over time or not applied at all, and in both cases this results in large inefficiencies in GPU utilization and thus higher infrastructure costs.

This challenge can be overcome with Dynamic GPU Partitioning. Later in this article we will see how to dynamically partition a GPU with MPS using the open source module nos, following an approach that also works with MIG.

Multi-Process Service (MPS) is a client-server implementation of the CUDA Application Programming Interface (API) for running multiple processes concurrently on the same GPU.

The server manages GPU access providing concurrency between clients. Clients connect to it through the client runtime, which is built into the CUDA Driver library and may be used transparently by any CUDA application.

MPS is compatible with basically every modern GPU and provides the highest flexibility, allowing to create GPU slices with arbitrary limits on both the amount of allocatable memory and the available compute. However, it does not enforce full memory isolation between processes. In most cases, MPS represents a good compromise between MIG and time-slicing.

Compared to time-slicing, MPS eliminates the overhead of context-switching by running processes in parallel through spatial sharing, and therefore leads to better compute performance. Moreover, MPS provides each process with its own GPU memory address space. This allows to enforce memory limits on the processes overcoming the limitations of time-slicing sharing.

In MPS, however, client processes are not fully isolated from each other. Indeed, even though MPS allows to limit clients’ compute and memory resources, it does not provide error isolation and memory protection. This means that a client process can crash and cause the entire GPU to reset, impacting all other processes running on the GPU.

The NVIDIA Kubernetes Device Plugin does not offer support for MPS partitioning, making it not straightforward to use it in Kubernetes. In the following section, we explore alternative methods for taking advantage of MPS for GPU sharing by leveraging nos and a different Kubernetes device plugin.

You can enable MPS partitioning in a Kubernetes cluster by installing this fork of the NVIDIA Device Plugin with Helm:

helm install oci://ghcr.io/nebuly-ai/helm-charts/nvidia-device-plugin \
--version 0.13.0 \
--generate-name \
-n nebuly-nvidia \
--create-namespace

By default, the Helm chart deploys the device plugin with MPS mode enabled on all nodes labeled nos.nebuly.com/gpu-partitioning=mps. To enable MPS partitioning on the GPUs of a specific node, you need to simply apply the label nos.nebuly.com/gpu-partitioning=mps to it.

It is likely that a version of the NVIDIA Device Plugin is already installed on your cluster. If you don’t want to remove it, you can choose to install this forked plugin alongside the original NVIDIA Device Plugin and run it only on specific nodes. To do so, it is important to ensure that only one of the two plugins is running on a node at a time. As described in the installation guide, this can be achieved by editing the specification of the original NVIDIA Device Plugin and adding an anti-affinity rule in its spec.template.spec, so that it does not run on the same nodes targeted by the forked plugin:

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nos.nebuly.com/gpu-partitioning
operator: NotIn
values:
- mps

After installing the device plugin, you can configure it to expose GPUs as multiple MPS resources by editing the sharing.mps section of its configuration. For example, the configuration below tells the plugin to expose to Kubernetes the GPU with index 0 as two GPU resources (named nvidia.com/gpu-4gb) with 4GB of memory each:

version: v1
sharing:
mps: 
resources:
- name: nvidia.com/gpu
rename: nvidia.com/gpu-4gb
memoryGB: 4
replicas: 2
devices: ["0"]

The resource name advertised to Kubernetes, the partition size and the number of replicas can be configured as needed. Going back to the example given above, a container can request a fraction of 4 GB of GPU memory as follows:

apiVersion: v1
kind: Pod
metadata:
name: mps-partitioning-example
spec:
hostIPC: true # 
securityContext:
runAsUser: 1000 # 
containers:
- name: sleepy
image: "busybox:latest"
command: ["sleep", "120"]
resources:
limits:
nvidia.com/gpu-4gb: 1 #

Note that there are a few constraints for Pods with containers requesting MPS resources:

Containers must run with the same user ID as the MPS server deployed with the device plugin, which is 1000 by default. You can change it by editing the value of mps.userID of the Device Plugin installation chart.
The pod specification must include hostIPC: true. Since MPS requires the clients and the server to share the same memory space, we need to allow the pods to access the IPC namespace of the host node so that it can communicate with the MPS server running on it.

In this example, the container can allocate only up to 2 GB of memory on the shared GPU. If it tries to allocate more memory, it will crash with an Out-Of-Memory (OOM) error without affecting the other Pods.

However, it is important to point out that nvidia-smi accesses the NVIDIA drivers bypassing the MPS client runtime. As a result, running nvidia-smi within the container will display the entire GPU resources in its output:

Overall, it is complex and time-consuming to manage MPS resources through Device Plugin configuration. Instead, it would be better just to create Pods requesting MPS resources and let someone else automatically provision and manage them.

Dynamic MPS Partitioning exactly does that: it automates the creation and deletion of MPS resources based on real-time requirements of the workloads in the cluster, ensuring the optimal sharing configuration is always applied to the available GPUs.

To apply dynamic partitioning, we need to use nos, an open-source module to efficiently run GPU workloads on Kubernetes.

I have already covered how to use nos for dynamic GPU partitioning based on Multi-Instance GPU (MIG). We therefore won’t delve into details here, as nos manages MPS partitioning in the same way. For more information, you can refer to the article Dynamic MIG Partitioning in Kubernetes, or check nos documentation.

The only difference between MPS and MIG dynamic partitioning is the value of the label used for telling nos for which nodes it should manage GPU partitioning. In the case of MPS, you need to label the nodes as follows:

kubectl label nodes <node-names> "nos.nebuly.com/gpu-partitioning=mps"