Cloud Experts Documentation

ROSA with Nvidia GPU Workloads

This content is authored by Red Hat experts, but has not yet been tested on every supported configuration.

ROSA guide to running Nvidia GPU workloads.

Prerequisites

  • ROSA Cluster (4.14+)
  • rosa cli #logged-in
  • oc cli #logged-in-cluster-admin
  • jq

If you need to install a ROSA cluster, please read our ROSA Quickstart Guide , or better yet Use Terraform to create an HCP Cluster .

Enter the oc login command, username, and password from the output of the previous command:

Example login:

Linux:

MacOS

Helm Prerequisites

If you do not want to use Helm you can follow the steps in the Manual section.

  1. Add the MOBB chart repository to your Helm

  2. Update your repositories

GPU Quota

  1. View the list of supported GPU instance types in ROSA

  2. Select a GPU instance type

    The guide uses g5.xlarge as an example. Please be mindful of the GPU cost of the type you choose.

  3. Login to AWS

    Login to AWS Consoleexternal link (opens in new tab) , type “quotas” in search by, click on “Service Quotas” -> “AWS services” -> “Amazon Elastic Compute Cloud (Amazon EC2). Search for “Running On-Demand [instance-family] instances” (e.g. Running On-Demand G and VT instances).

    Please remember that when you request quota that AWS is per core. As an example, to request a single g5.xlarge, you will need to request quota in groups of 4; to request a single g5.8xlarge, you will need to request quota in groups of 32.

  4. Verify quota and request increase if necessary

    GPU Quota Request on AWS

GPU Machine Pool

  1. Set environment variables

  2. Create GPU machine pool

  3. Verify GPU machine pool

    It may take 10-15 minutes to provision a new GPU machine. If this step fails, please login to the AWS Consoleexternal link (opens in new tab) and ensure you didn’t run across availability issues. You can go to EC2 and search for instances by cluster name to see the instance state.

  4. Double check that the cluster shows the node as ready

Install and Configure Nvidia GPU

This section configures the Node Feature Discovery Operator (to allow OpenShift to discover the GPU nodes) and the Nvidia GPU Operator.

  1. Create namespaces

  2. Use the mobb/operatorhub chart to deploy the needed operators

  3. Wait until the two operators are running

    Note: If you see an error like Error from server (NotFound): deployments.apps "nfd-controller-manager" not found, wait a few minutes and try again.

  4. Install the Nvidia GPU Operator chart

  5. Wait until NFD instances are ready

  6. Wait until Cluster Policy is ready

    Note: This step may take a few minutes to complete.

Validate GPU

  1. Verify NFD can see your GPU(s)

    You should see output like:

  2. Verify GPU Operator added node label to your GPU nodes

  3. [Optional] Test GPU access using Nvidia SMI

    You should see output that shows the GPUs available on the host such as this example screenshot. (Varies depending on GPU worker type)

    Nvidia SMI
  4. Create Pod to run a GPU workload

  5. View logs

    Please note, if you get an error “Error from server (BadRequest): container “cuda-vector-add” in pod “cuda-vector-add” is waiting to start: ContainerCreating” try running “oc delete pod cuda-vector-add” and then re-run the create statement above. We’ve seen issues where if this step is ran before all of the operator consolidation is done it may just sit there.

    You should see Output like the following (mary vary depending on GPU):

  6. If successful, the pod can be deleted

Interested in contributing to these docs?

Collaboration drives progress. Help improve our documentation The Red Hat Way.

Red Hat logo LinkedIn YouTube Facebook Twitter

Products

Tools

Try, buy & sell

Communicate

About Red Hat

We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Subscribe to our newsletter, Red Hat Shares

Sign up now
© 2023 Red Hat, Inc.