multi-size-blocks

Image by Tom from Pixabay

Today, GPUs have become a ubiquitous type of hardware accelerators, often used for Artificial Intelligence and Machine Learning (AI/ML) applications. On an elastic cloud platform, such applications can benefit from the cost saving tool of machine autoscaling, i.e. automatically adding compute power as the load increases, and removing unutilized compute when the load drops. This type of autoscaling eliminates the need to pay upfront for hardware resources, or for idle or underutilized ones.

A great agnostic, enterprise-grade way to run containerized applications, including AI/ML, on a variety of private and public clouds is Red Hat OpenShift Container Platform. It offers cluster autoscaling out of the box, and supports NVIDIA GPUs via the NVIDIA GPU Operator. However, properly autoscaling clusters with GPU-enabled worker nodes can be challenging.

We will see how to configure OpenShift cluster autoscaling for NVIDIA GPUs, show useful tips and patterns, and point out a few pitfalls to avoid. We ran the examples in this blog post on AWS as Amazon offers relatively cheap instances with NVIDIA Tesla T4 GPU, e.g. g4dn.xlarge.

Autoscale NVIDIA GPUs Like a Pro

You will need the NVIDIA GPU Operator to run GPU-accelerated workloads on a Red Hat OpenShift cluster. This blog post is a follow-up and refresh of Sebastian Jug's Simplifying deployments of accelerated AI workloads on Red Hat OpenShift with NVIDIA GPU Operator. Things have evolved since then, and the operator can now be deployed via Operator Lifecycle Manager (OLM). It also does not require entitlement anymore. The installation is done in a breeze with the simple steps provided in the NVIDIA GPU Operator on OpenShift documentation.

Now, read the Cluster Autoscaling in OpenShift 4.12 documentation to enable cluster autoscaling and make sure that your workloads always have a GPU to run on.

A request for GPU resources may come from a new application deployed to the same OpenShift cluster, or from an existing application that is asking for additional pods to handle all requests, e.g. using a custom metrics autoscaler on OpenShift. In either case, it is the cluster autoscaler that takes care of adding hardware if needed.

Keep in mind that as the GPU is an extended resource, it requires extra care when autoscaling.

  • It may take considerable time before a newly provisioned node can expose its GPU capacity, due to NVIDIA GPU driver installation and configuration.
  • The capacity may differ from the actual hardware count and therefore not known beforehand, leading to overprovisioning.

Below you will find information that will help you deal with these challenges and get the most out of GPU autoscaling.

Take Advantage of NVIDIA GPU Sharing

In many cases a pod does not need an entire GPU all for itself, and can do with just a fraction of the hardware GPU power, leaving the rest for other pods. NVIDIA GPU sharing mechanisms at the system or hardware level, i.e. transparent to applications, are a great way to improve the GPU utilization. For ML training jobs, you will usually need multiple full GPUs. If you run inference, on the other hand, and depending on the size of your model, you should use the NVIDIA GPU sharing capabilities to parallelize multiple jobs on a single GPU.

For instance, the NVIDIA multi-instance GPU (MIG) technology allows you to split the hardware resources of a GPU into multiple partitions. Each partition is seen as an independent GPU instance, with memory and fault isolation at the hardware layer between the instances. At the time of writing MIG is only supported on A30, A100, and H100 NVIDIA GPU accelerators.

Time-slicing NVIDIA GPUs in OpenShift is another method of sharing access to a GPU. It does not provide memory or fault-isolation between GPU replicas, but works on older generation GPUs that do not support MIG. Time-slicing is supported with all NVIDIA GPUs because it relies on CUDA compute preemption and a specific device plugin configuration.

apiVersion: v1
kind: ConfigMap
metadata:
name: device-plugin-config
namespace: nvidia-gpu-operator
data:
Tesla-T4-time-sliced: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4

 

In both cases, the node will expose more GPU instances than its hardware actually has, affecting pod scheduling.

$ oc get node ip-10-0-134-110.us-east-2.compute.internal -o jsonpath='{.metadata.labels}' | jq | grep -e 'nvidia.com/gpu.count' -e 'nvidia.com/gpu.replicas'
"nvidia.com/gpu.count": "1",
"nvidia.com/gpu.replicas": "4",
$ oc get node ip-10-0-134-110.us-east-2.compute.internal -o jsonpath='{.status.capacity}' | jq | grep 'nvidia.com/gpu'
"nvidia.com/gpu": "4",

 

Time-slicing is cheaper to experiment with for GPU autoscaling. If you need MIG support, you can use EC2 P4 instances with A100 GPU on AWS, NC A100 v4-series on Azure, or the A2 accelerator-optimized machine type on GCP.

You may mix partitioned and non-partitioned GPUs, and GPU-enabled and non-GPU compute nodes in a single cluster, and assign pods to nodes depending on your needs. More on this later.

Avoid Overprovisioning

When you autoscale a cluster that runs the NVIDIA GPU Operator, you may encounter undesired overprovisioning, when more nodes than needed are created by the autoscaler. There may be multiple reasons for that.

The GPU is an extended resource, as opposed to a core one such as CPU or Memory. Therefore it takes time for the NVIDIA GPU operator to discover the GPU of a node, set it up and eventually let the other subsystems know which type, and how many, GPUs the node has.

Until the GPU shows up and can be allocated to pods waiting for GPU resources:

  1. If you reserve the node solely for GPU-accelerated workloads, or there are no pending non-GPU workloads, the node may remain idle, potentially triggering a scale-down event and deleting the node if the scale-down timers are too aggressive.
  2. The autoscaler may decide that the node does not have the resources required to schedule the pending pods that require GPU, and attempt to provision another node.

Experiment with your autoscaler timers to avoid the first problem, for instance

delayAfterAdd: 20m
unneededTime: 5m

 

The way to deal with the second issue is by adding a cluster-api/accelerator label to your nodes, as explained in When using the Nvidia GPU Operator, more nodes than needed are created by the Node Autoscaler.

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
...
spec:
...
template:
...
spec:
metadata:
labels:
cluster-api/accelerator: "<type>"

 

The type must match a GPU type in the ClusterAutoscaler resource.

  resourceLimits:
gpus:
- type: <type>
min: <min>
max: <max>

 

Although the mere presence of the label will cause the autoscaler to wait until a node's GPU has been discovered, it is also a good way to distinguish between nodes with different GPU models or GPU-sharing methods. As you will see, this may come handy when assigning pods to particular GPU types.

As we are using both time-sliced and non-shared GPUs in our cluster, our ClusterAutoscaler resource looks like follows:

apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
metadata:
name: default
spec:
logVerbosity: 4
maxNodeProvisionTime: 10m
podPriorityThreshold: -10
resourceLimits:
gpus:
- max: 2
min: 0
type: Tesla-T4
- max: 16
min: 0
type: Tesla-T4-SHARED
maxNodesTotal: 20
scaleDown:
delayAfterAdd: 20m
delayAfterDelete: 10s
delayAfterFailure: 5m
enabled: true
unneededTime: 5m
utilizationThreshold: "0.5"

 

Watch Out When Scaling Up from Zero

You may autoscale existing GPU workers by creating a MachineAutoscaler that references an existing machine set, or add a completely new machine set (MachineSet resource). In the latter case, watch out for the overprovisioning issue described below.

Consider the following scenario:

  • You want to schedule a deployment with two replicas.
  • Each pod of the deployment requests one GPU.
  • Your machine set is configured for cloud instances that have one hardware GPU each.
  • But you are using GPU sharing so that the actual GPU capacity of a node is four.

The two-replica deployment is supposed to fit onto a single GPU node in this case. However, when a cluster starts with zero machines in an auto-scalable machine set, and since the GPU is an extended resource, the platform will not have knowledge of the actual GPU capacity of a new machine. As a result it will try to provision a node per requested GPU, leading to overprovisioning.

Notice that this is different from scaling up a machine set that already has at least one machine in it, or scaling from zero a machine set that contained machines in the past (i.e. has been scaled down). This is because the platform will have hints about the capacity of the machine set's members.

The way it works is the autoscaler tries to derive a "template" from existing machines in a machine set, in particular the node capacity, and use that template for future autoscaling.

To deal with the problem, upstream Kubernetes allows setting hints for the autoscaler in these situations. OpenShift supports a subset of the manual hints, with more coming in future versions. On supported platforms, OpenShift will communicate the NVIDIA GPU count to the cluster autoscaler according to the machine type.

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
annotations:
machine.openshift.io/GPU: "1"

 

However, partitioned GPUs are not supported yet. Right now your best options with GPU sharing are either to preheat the autoscaler by pre-provisioning a node from a GPU-enabled auto-scalable machine set, or gradually increase the replica count of a GPU-accelerated deployment when scaling from zero.

Another point to pay attention to when adding a new MachineSet is a mismatch between its scale definition and the MachineAutoscaler. If you create a MachineSet resource with replicas: 1 and there are no workload pods to schedule, but the MachineAutoscaler has minReplicas: 0, a node will be added and then deleted (scaled down) by the autoscaler. You probably want to avoid that.

Control Where Your Pods Run

Let us consider the following use case. A customer needs to run ML training that requires a powerful GPU, is performed once a day and can wait until a suitable GPU node becomes available. On the other hand, inference requests keep flowing in all the time, must be processed quickly, but can run on less expensive or partitioned GPUs.

It is possible to have multiple MachineAutoscaler (and MachineSet) resources in a cluster, and use nodeSelector to autoscale depending on the requested workload so that the workload is assigned to the optimal GPU profile. We have successfully tested a cluster with two machine sets as follows.

One contains machines with a full GPU for our ML training jobs

kind: MachineSet
spec:
template:
spec:
metadata:
labels:
cluster-api/accelerator: "Tesla-T4"

 

with the corresponding workloads defined as

kind: Deployment
spec:
template:
spec:
nodeSelector:
cluster-api/accelerator: "Tesla-T4"

 

The other is configured for time-sliced GPUs (here Tesla-T4-time-sliced refers to an entry in the NVIDIA GPU operator's device plugin configuration).

kind: MachineSet
spec:
template:
spec:
metadata:
labels:
nvidia.com/device-plugin.config: "Tesla-T4-time-sliced"
cluster-api/accelerator: "Tesla-T4-SHARED"

 

A workload in this case will have

kind: Deployment
spec:
template:
spec:
nodeSelector:
cluster-api/accelerator: "Tesla-T4-SHARED"

 

Using this method, an OpenShift user can also reserve their valuable GPU resources for workloads that require GPUs, and schedule non-GPU workloads on cheaper nodes, while also preserving the autoscaling functionality. Keep in mind that a custom label will be needed to create affinity for non-GPU nodes.

Overprovision to Have a GPU at The Ready

Let us look again at the ML example in the previous section. Remember that we want our inference requests to be processed as quickly as possible. This is not a problem when there are available nodes with free GPU capacity, but can be a challenge when the cluster must autoscale — due to slow GPU provisioning.

The solution is to make the autoscaler add a standby node every time the existing GPU capacity of a cluster is close to reaching saturation. This can be achieved using the intentional overprovisioning, with low-priority pause pods that will be evicted as soon as a "real" workload needs the resources.

In case of a shared GPU, the behavior can be fine-tuned to control the utilization of a node until creation of a standby node is triggered. For example, if a node has four GPU partitions and the overprovisioning deployment is configured with one replica — autoscaling will be triggered as soon as your workloads have consumed the entire node (four GPUs). Alternatively, an overprovisioning deployment with three replicas will trigger autoscaling as soon as the node's GPU consumption goes beyond 50% (three GPUs).

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: overprovisioning
value: -10
globalDefault: false
description: Priority class used by overprovisioning
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: tesla-t4-shared-overprovisioning
namespace: default
spec:
replicas: 2
selector:
matchLabels:
run: tesla-t4-shared-overprovisioning
template:
metadata:
labels:
run: tesla-t4-shared-overprovisioning
spec:
priorityClassName: overprovisioning
terminationGracePeriodSeconds: 0
nodeSelector:
cluster-api/accelerator: "Tesla-T4-SHARED"
containers:
- name: reserve-resources
image: registry.k8s.io/pause:3.9
securityContext:
seccompProfile:
type: RuntimeDefault
capabilities:
drop:
- ALL
runAsNonRoot: true
allowPrivilegeEscalation: false
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"

 

Notice the nodeSelector we are using to target a particular auto-scalable machine set.

In this example we specified how many GPUs the deployment requests. However, the non-GPU resources of a node (CPU, memory) may become exhausted even before the node runs out of its GPU capacity. In that case it will be wrong to trigger autoscaling based only on GPU utilization, and you will need to take into account other resources as well.

Also keep in mind that a cluster cannot scale beyond the limit set in its autoscaling configuration.

If you want to read more on proactive autoscaling using pause pods, the blog post How Full Is My Cluster, Part 6: Proactive Node Autoscaling offers an excellent explanation.

Scale Down to Save Money

The autoscaler can not only provision nodes when more compute power is needed, but also delete underutilized nodes to save money. In order to benefit from automatic scale down, it must be enabled in the ClusterAutoscaler resource:

  scaleDown:
enabled: true

 

Take time to understand the configuration. As we already mentioned, too aggressive scale down timers may render the entire cluster unstable. You should also consider other factors when setting your scale-down policies. For example, if the platform pricing is by the hour, you may want to keep a node for that time period even if it is underutilized (delayAfterAdd). This way if there is a chance your application may need to scale up again, you will use the node you have already paid for. Of course, this can only work for spike demands that happen within an hour, while keeping underutilized nodes for long-running workloads may prove more expensive. With GPU-enabled nodes, remember to take into account the additional set-up time — node feature discovery (NFD), labeling, GPU driver installation and configuration all count towards the total lease time of a cloud machine.

Experiment with your platform and workloads to get optimal values.

Mind Resource Fragmentation

As we already mentioned, GPU sharing may make autoscaling clusters with GPU workers even more challenging. In addition, it may contribute to resource fragmentation, one of the undesired effects of which is that underutilized nodes cannot be scaled down because there are pods still running on them.

Consider the following scenario.

  1. GPU-accelerated workloads — jobs or deployments — are set to run on a cluster. The workloads together make up eight pods, where each pod requests one GPU slice (partition).
  2. The cluster scales up to satisfy the requests and ends up with two GPU-enabled workers, each divided into four GPU partitions. When the pods are eventually scheduled, they occupy all 8 GPU partitions of the cluster.
  3. After a while, some jobs finish, or deployments scale down due to reduction in the load, which causes four of the pods to be terminated.
  4. This may leave each node running two GPU-accelerated pods, i.e. at 50% utilization. At least from the perspective of requested GPUs, all remaining workloads could fit on a single node and let the other one be scaled down (remember though that the allocation of other resources like CPU and memory will also affect the fragmentation).

Another problem here is that a workload that requests more than 2 GPUs cannot be scheduled without adding another node.

Resource fragmentation can be mitigated by relocating pods between nodes to rebalance the cluster, e.g. using node draining. However, this means that the pods will be restarted.

Ideally, we need automated ways to deal with resource fragmentation. The HighNodeUtilization descheduling strategy in upstream Kubernetes does exactly this. It is currently not available in OpenShift, but will be added in the future.

Be careful though, as it often does not make practical sense to terminate a GPU-accelerated workload in order to relocate it to another node, unless the cost savings can justify it. Alternatively, a workload can be made to handle disruptions by either being stateless, or by being able to save intermediate results when stopped and resume the processing when restarted.

Also, you should not remove a node if this will affect the high availability of an application.

Scheduling that focuses on maximizing GPU utilization can also help the autoscaling and cost management when an incoming workload lands on an already present node. Read Packing workloads on AI supercomputer in the cloud to learn how a team at IBM has been optimizing resources and minimizing fragmentation with GPU-accelerated AI workloads at the scheduling phase.

Conclusion

Red Hat OpenShift proves to be an excellent platform for managing and autoscaling NVIDIA GPUs for containerized AI/ML workloads. Its robust features, such as cluster autoscaling and support for the NVIDIA GPU Operator, provide a scalable and enterprise-grade platform for efficiently utilizing GPU resources.

When autoscaling GPU-enabled worker nodes, leveraging NVIDIA GPU sharing mechanisms, like the multi-instance GPU (MIG) technology or time-slicing, is crucial for maximizing GPU utilization and parallelization. This makes sure the optimal resource allocation and cost-efficiency.

Cluster autoscaling is a powerful tool that can help you optimize your GPU-enabled clusters. However, scaling GPUs has its challenges, in particular due to GPU being an extended Kubernetes resource. Hopefully, our best practices will help you to overcome those challenges and avoid common mistakes when autoscaling NVIDIA GPUs on Red Hat OpenShift.


About the authors

Michael McCune is a software developer creating open source infrastructure and applications for cloud platforms. He has a passion for problem solving and team building, and a lifelong love of music, food, and culture.

Read full bio