ROSA with Nvidia GPU Workloads - Manual
This content is authored by Red Hat experts, but has not yet been tested on every supported configuration.
This is a guide to install GPU on ROSA cluster manually, which is an alternative to our Helm chart guide .
Prerequisites
- ROSA cluster (4.14+)
- rosa cli
- oc cli
1. Setting up GPU machine pools
In this tutorial, I’m using g5.4xlarge node for the GPU machine pools with auto-scaling enabled up to 4 nodes. Please replace your-cluster-name with the name of your cluster.
Note that you can also use another instance type and not using auto-scaling.
2. Installing NFD operator
The Node Feature Discovery operator will discover the GPU on your nodes and NFD instance will appropriately label the nodes so you can target them for workloads. Please refer to the official OpenShift documentation for more details.
Note that this above might take a few minutes. And then next, we will create the NFD instance.
3. Installing GPU operator
Next, we will set up
NVIDIA GPU Operator
that manages NVIDIA software components and ClusterPolicy object to ensure the right setup for NVIDIA GPU in the OpenShift environment. Please refer to the
official NVIDIA documentation
for more details.
And finally, let’s update the ClusterPolicy.
Validating GPU (optional)
By now you should have your GPU setup correctly, however, if you’d like to validate it, you could run the following on terminal.
In essence, here we verify that NFD can detect the GPUs, run nvidia-smi on the GPU driver daemonset pod, run a simple CUDA vector addition test pod, and delete it.
Note that this validation step could take a few minutes to complete. And if you were seeing any error(s) such as “No GPU nodes detected”, “Failed to run nvidia-smi”, etc., then you might want to try again in the next few minutes.