Cloud Experts Documentation

ROSA with NVIDIA GPU workloads and OpenShift AI

This content is authored by Red Hat experts, but has not yet been tested on every supported configuration. This guide has been validated on OpenShift 4.20. Operator CRD names, API versions, and console paths may differ on other versions.

This guide shows how to add NVIDIA GPU capacity to an existing Red Hat OpenShift Service on AWS (ROSA) cluster and validate it for use with Red Hat OpenShift AI.

The flow in this guide covers:

  • creating a GPU machine pool on ROSA
  • installing Node Feature Discovery (NFD)
  • installing the NVIDIA GPU Operator
  • creating a ClusterPolicy
  • verifying that the GPU is exposed to the cluster
  • enabling OpenShift AI hardware profiles
  • creating a GPU-backed hardware profile and validating a GPU-enabled workbench

This guide was validated on ROSA 4.20 with OpenShift AI 2025.2 using an NVIDIA Tesla T4 GPU on an AWS g4dn.xlarge instance.

0. Prerequisites

Before you begin, make sure you have:

  • an existing ROSA cluster with cluster-admin access
  • the rosa CLI configured for your cluster
  • the oc CLI configured and logged in
  • sufficient AWS quota and capacity for a GPU instance type in your target Region and Availability Zone
  • Red Hat OpenShift AI already installed if you want to validate GPU-backed workbenches from the dashboard. You can follow Step 1-2 from this article to install RHOAI operator.

During validation, a 2-worker m5.xlarge machine pool did not provide enough schedulable capacity for this walkthrough. Some OpenShift AI components could not be scheduled, and the OpenShift AI dashboard remained in a Not Ready state. Use at least 3 worker nodes, enable autoscaling, or create a dedicated machine pool for OpenShift AI if the existing workers are already heavily used.

This walkthrough was validated on an existing ROSA cluster in ca-central-1 using a g4dn.xlarge GPU machine pool.

1. Create a GPU machine pool

Start by creating a dedicated GPU machine pool instead of modifying existing worker pools. This keeps GPU workloads isolated and makes scheduling easier to reason about down the road.

The GPU machine pool can take several minutes to provision. Wait until the new node joins the cluster and the machine pool shows 1/1.

At this stage, the GPU node existed but did not yet advertise nvidia.com/gpu, because the GPU software stack had not yet been installed.

2. Install the Node Feature Discovery Operator

Install the Node Feature Discovery (NFD) Operator. NFD is used to discover hardware capabilities and label nodes appropriately.

In this guide, the operators are installed with the CLI for repeatability and easy copy/paste. You can also install the same operators from Software Catalog (formerly known as OperatorHub) in the OpenShift web console if you prefer clicking through a UI.

Wait for the operator to install:

Create the NodeFeatureDiscovery instance:

Verify the pods:

At this point, all NFD components should be in Running state.

3. Install the NVIDIA GPU Operator

After NFD is installed, install the NVIDIA GPU Operator.

In this guide, the operators are installed with the CLI for repeatability and easy copy/paste. You can also install the same operators from Software Catalog (formerly known as OperatorHub) in the OpenShift web console if you prefer a UI-based workflow.

Option A: Install the certified operator from Software Catalog

Wait for the CSV:

In this validation, the installed CSV was gpu-operator-certified.v26.3.0.

Option B: Install the NVIDIA GPU Operator with Helm

As an alternative, you can install the NVIDIA GPU Operator directly from NVIDIA’s maintained Helm chart.

Because Node Feature Discovery is already installed separately on OpenShift, disable the chart-managed NFD deployment during the Helm install.

In this validation, the installed chart version was v26.3.0.

To see newer chart versions, run `helm search repo nvidia/gpu-operator --versions` before installing.

4. Create the ClusterPolicy

If you installed the NVIDIA GPU Operator with Helm, you can skip this step because the chart already creates the ClusterPolicy.

Verify readiness:

The gpu-cluster-policy should reach State: ready.

Note that it may take 15-20 minutes for the ClusterPolicy to become ready while the NVIDIA driver components are deployed and initialized on the GPU node.

5. Verify GPU capacity on the node

Once the ClusterPolicy ready, verify that the GPU node exposes allocatable GPU resources.

An example of expected output:

The GPU node reported nvidia.com/gpu: "1" which means that the NVIDIA stack was working at the cluster level.

6. Validate the GPU with a simple pod

Before moving to OpenShift AI, validate the GPU with a simple test pod.

Watch it and inspect the logs:

The nvidia-smi output should show the GPU used (in this example Tesla T4) and confirm that the driver and CUDA stack were functioning correctly.

7. Enable OpenShift AI hardware profiles

To expose the newer hardware-profile workflow in the OpenShift AI dashboard, you need to enable hardware profiles in the OdhDashboardConfig custom resource.

This enables Settings -> Hardware profiles in the dashboard:

Verify that the change took effect:

Wait for a few minutes and refresh the dashboard page. The dashboard should now display Hardware profiles under Settings.

8. Create a GPU hardware profile in OpenShift AI

Click Create hardware profile.

These are the hardware profile settings validated in this guide:

  • Name: t4-gpu

  • Visibility: Visible everywhere

  • Additional resource:

    • Resource name: nvidia-gpu
    • Resource identifier: nvidia.com/gpu
    • Resource type: Other
    • Default: 1
    • Minimum allowed: 1
    • Maximum allowed: 1
  • Node selector:

    • Key: nvidia.com/gpu.present
    • Value: true
  • Toleration:

    • Key: nvidia.com/gpu
    • Operator: Equal
    • Value: true
    • Effect: NoSchedule

Once created, the hardware profile should look like this:

hardware-profile

9. Create and validate a GPU-backed workbench

After you create the GPU hardware profile, create a data science project and then a workbench using the hardware profile you just created.

Wait until the status is Running per snippet below:

project-gpu

To verify where the workbench landed and what resources it requested, inspect the pod:

At this stage, the workbench pod should:

  • run on the GPU node
  • request nvidia.com/gpu: "1"
  • use nodeSelector: nvidia.com/gpu.present: "true"
  • include a toleration for nvidia.com/gpu=true:NoSchedule

Finally, click the workbench, launch a terminal, and confirm that the GPU is available:

nvidia-smi

As seen from the above output, nvidia-smi inside the workbench showed an NVIDIA Tesla T4, confirming that the workbench had end-to-end GPU access through OpenShift AI.

10. Cleanup

If you no longer need the GPU test resources, remove them after validation.

Delete the standalone validation pod:

Delete the OpenShift AI workbench from the dashboard, or remove the workbench pod and project resources from the CLI as needed:

If you created a dedicated OpenShift AI project only for this test, you can remove it:

Delete the ClusterPolicy:

Delete the GPU Operator resources:

Delete the NFD resources:

If you no longer need GPU worker capacity on the cluster, delete the GPU machine pool:

Verify that the GPU node has been removed:

If you enabled hardware profiles only for this validation and do not want to leave them exposed in the dashboard, you can revert the dashboard setting:

If you created a dedicated GPU hardware profile in the OpenShift AI dashboard, remove it from Settings -> Hardware profiles when it is no longer needed.

Back to top

Interested in contributing to these docs?

Collaboration drives progress. Help improve our documentation The Red Hat Way.

Red Hat logo LinkedIn YouTube Facebook Twitter

Products

Tools

Try, buy & sell

Communicate

About Red Hat

We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Subscribe to our newsletter, Red Hat Shares

Sign up now
© 2026 Red Hat