Cloud Experts Documentation

ROSA with Nvidia GPU Workloads - Manual

This content is authored by Red Hat experts, but has not yet been tested on every supported configuration.

This is a guide to install GPU on ROSA cluster manually, which is an alternative to our Helm chart guide .

Prerequisites

  • ROSA cluster (4.14+)
    • You can install a Classic version using CLI or an HCP one using Terraform .
    • Please be sure you are logged in to the cluster with a cluster admin access.
  • rosa cli
  • oc cli

1. Setting up GPU machine pools

In this tutorial, I’m using g5.4xlarge node for the GPU machine pools with auto-scaling enabled up to 4 nodes. Please replace your-cluster-name with the name of your cluster.

Note that you can also use another instance type and not using auto-scaling.

2. Installing NFD operator

The Node Feature Discovery operatorexternal link (opens in new tab) will discover the GPU on your nodes and NFD instance will appropriately label the nodes so you can target them for workloads. Please refer to the official OpenShift documentation for more details.

Note that this above might take a few minutes. And then next, we will create the NFD instance.

3. Installing GPU operator

Next, we will set up NVIDIA GPU Operatorexternal link (opens in new tab) that manages NVIDIA software components and ClusterPolicy object to ensure the right setup for NVIDIA GPU in the OpenShift environment. Please refer to the official NVIDIA documentationexternal link (opens in new tab) for more details.

And finally, let’s update the ClusterPolicy.

Validating GPU (optional)

By now you should have your GPU setup correctly, however, if you’d like to validate it, you could run the following on terminal.

In essence, here we verify that NFD can detect the GPUs, run nvidia-smi on the GPU driver daemonset pod, run a simple CUDA vector addition test pod, and delete it.

Note that this validation step could take a few minutes to complete. And if you were seeing any error(s) such as “No GPU nodes detected”, “Failed to run nvidia-smi”, etc., then you might want to try again in the next few minutes.

Interested in contributing to these docs?

Collaboration drives progress. Help improve our documentation The Red Hat Way.

Red Hat logo LinkedIn YouTube Facebook Twitter

Products

Tools

Try, buy & sell

Communicate

About Red Hat

We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Subscribe to our newsletter, Red Hat Shares

Sign up now
© 2023 Red Hat, Inc.