ROSA with NVIDIA GPU workloads and OpenShift AI
This content is authored by Red Hat experts, but has not yet been tested on every supported configuration. This guide has been validated on OpenShift 4.20. Operator CRD names, API versions, and console paths may differ on other versions.
This guide shows how to add NVIDIA GPU capacity to an existing Red Hat OpenShift Service on AWS (ROSA) cluster and validate it for use with Red Hat OpenShift AI.
The flow in this guide covers:
- creating a GPU machine pool on ROSA
- installing Node Feature Discovery (NFD)
- installing the NVIDIA GPU Operator
- creating a
ClusterPolicy - verifying that the GPU is exposed to the cluster
- enabling OpenShift AI hardware profiles
- creating a GPU-backed hardware profile and validating a GPU-enabled workbench
This guide was validated on ROSA 4.20 with OpenShift AI 2025.2 using an NVIDIA Tesla T4 GPU on an AWS g4dn.xlarge instance.
0. Prerequisites
Before you begin, make sure you have:
- an existing ROSA cluster with
cluster-adminaccess - the
rosaCLI configured for your cluster - the
ocCLI configured and logged in - sufficient AWS quota and capacity for a GPU instance type in your target Region and Availability Zone
- Red Hat OpenShift AI already installed if you want to validate GPU-backed workbenches from the dashboard. You can follow Step 1-2 from this article to install RHOAI operator.
During validation, a 2-worker m5.xlarge machine pool did not provide enough schedulable capacity for this walkthrough. Some OpenShift AI components could not be scheduled, and the OpenShift AI dashboard remained in a Not Ready state. Use at least 3 worker nodes, enable autoscaling, or create a dedicated machine pool for OpenShift AI if the existing workers are already heavily used.
This walkthrough was validated on an existing ROSA cluster in ca-central-1 using a g4dn.xlarge GPU machine pool.
1. Create a GPU machine pool
Start by creating a dedicated GPU machine pool instead of modifying existing worker pools. This keeps GPU workloads isolated and makes scheduling easier to reason about down the road.
The GPU machine pool can take several minutes to provision. Wait until the new node joins the cluster and the machine pool shows 1/1.
At this stage, the GPU node existed but did not yet advertise nvidia.com/gpu, because the GPU software stack had not yet been installed.
2. Install the Node Feature Discovery Operator
Install the Node Feature Discovery (NFD) Operator. NFD is used to discover hardware capabilities and label nodes appropriately.
Wait for the operator to install:
Create the NodeFeatureDiscovery instance:
Verify the pods:
At this point, all NFD components should be in Running state.
3. Install the NVIDIA GPU Operator
After NFD is installed, install the NVIDIA GPU Operator.
Option A: Install the certified operator from Software Catalog
Wait for the CSV:
In this validation, the installed CSV was gpu-operator-certified.v26.3.0.
Option B: Install the NVIDIA GPU Operator with Helm
As an alternative, you can install the NVIDIA GPU Operator directly from NVIDIA’s maintained Helm chart.
Because Node Feature Discovery is already installed separately on OpenShift, disable the chart-managed NFD deployment during the Helm install.
In this validation, the installed chart version was v26.3.0.
4. Create the ClusterPolicy
If you installed the NVIDIA GPU Operator with Helm, you can skip this step because the chart already creates the ClusterPolicy.
Verify readiness:
The gpu-cluster-policy should reach State: ready.
Note that it may take 15-20 minutes for the ClusterPolicy to become ready while the NVIDIA driver components are deployed and initialized on the GPU node.
5. Verify GPU capacity on the node
Once the ClusterPolicy ready, verify that the GPU node exposes allocatable GPU resources.
An example of expected output:
The GPU node reported nvidia.com/gpu: "1" which means that the NVIDIA stack was working at the cluster level.
6. Validate the GPU with a simple pod
Before moving to OpenShift AI, validate the GPU with a simple test pod.
Watch it and inspect the logs:
The nvidia-smi output should show the GPU used (in this example Tesla T4) and confirm that the driver and CUDA stack were functioning correctly.
7. Enable OpenShift AI hardware profiles
To expose the newer hardware-profile workflow in the OpenShift AI dashboard, you need to enable hardware profiles in the OdhDashboardConfig custom resource.
This enables Settings -> Hardware profiles in the dashboard:
Verify that the change took effect:
Wait for a few minutes and refresh the dashboard page. The dashboard should now display Hardware profiles under Settings.
8. Create a GPU hardware profile in OpenShift AI
Click Create hardware profile.
These are the hardware profile settings validated in this guide:
-
Name:
t4-gpu -
Visibility: Visible everywhere
-
Additional resource:
- Resource name:
nvidia-gpu - Resource identifier:
nvidia.com/gpu - Resource type:
Other - Default:
1 - Minimum allowed:
1 - Maximum allowed:
1
- Resource name:
-
Node selector:
- Key:
nvidia.com/gpu.present - Value:
true
- Key:
-
Toleration:
- Key:
nvidia.com/gpu - Operator:
Equal - Value:
true - Effect:
NoSchedule
- Key:
Once created, the hardware profile should look like this:
9. Create and validate a GPU-backed workbench
After you create the GPU hardware profile, create a data science project and then a workbench using the hardware profile you just created.
Wait until the status is Running per snippet below:
To verify where the workbench landed and what resources it requested, inspect the pod:
At this stage, the workbench pod should:
- run on the GPU node
- request
nvidia.com/gpu: "1" - use
nodeSelector: nvidia.com/gpu.present: "true" - include a toleration for
nvidia.com/gpu=true:NoSchedule
Finally, click the workbench, launch a terminal, and confirm that the GPU is available:
As seen from the above output, nvidia-smi inside the workbench showed an NVIDIA Tesla T4, confirming that the workbench had end-to-end GPU access through OpenShift AI.
10. Cleanup
If you no longer need the GPU test resources, remove them after validation.
Delete the standalone validation pod:
Delete the OpenShift AI workbench from the dashboard, or remove the workbench pod and project resources from the CLI as needed:
If you created a dedicated OpenShift AI project only for this test, you can remove it:
Delete the ClusterPolicy:
Delete the GPU Operator resources:
Delete the NFD resources:
If you no longer need GPU worker capacity on the cluster, delete the GPU machine pool:
Verify that the GPU node has been removed:
If you enabled hardware profiles only for this validation and do not want to leave them exposed in the dashboard, you can revert the dashboard setting:
If you created a dedicated GPU hardware profile in the OpenShift AI dashboard, remove it from Settings -> Hardware profiles when it is no longer needed.