Red-Hat-NVIDIA_co-brand_logo

Note: The following procedure can also be used to deploy the NVIDIA GPU Operator, since it follows the same prerequisites as the SRO operator. Docs are here.

The job of the Performance and Latency Sensitive Applications (PSAP) team at Red Hat is optimizing Red Hat OpenShift, the industry’s most comprehensive enterprise Kubernetes platform, to run compute-intensive enterprise workloads and HPC applications effectively and efficiently. As a team of Linux and performance enthusiasts who are always pushing the limits of what is possible with the latest and greatest upstream technologies, we are operating at the forefront of innovation with compelling proof-of-concept (POC) implementations and advanced deployment scenarios.  

Overview

Driver containers are a novel way of including device specific kernel modules (kmods) within an OCI container. Since these kmods have close dependencies on kernel versions (and kernel headers),  they need to be (re) compiled on the target host. The special resource operator (SRO for short) was designed for this purpose.

However, the SRO needs access to RHEL source code from the target host. And while this is fully automated in environments that can access the internet and ergo the RHEL source code, setting it up for disconnected environments requires some more configuration.

This blog post details the deployment of SRO/driver containers on disconnected (true disconnected and proxy) environments.

Prerequisites

You must have access to the internet to obtain the data that populates the mirror repository. In this procedure, you will place the mirror registry on a bastion host that has access to both your network and the internet. If you do not have access to a bastion host, use the method that best fits your restrictions to bring the contents of the mirror registry into your restricted network. You also must have a Red Hat Enterprise Linux (RHEL) server on your network to use as the registry host. The registry host MUST be able to access the internet, or at least allow access to the needed URL’s mentioned through this guide.

The cluster must be properly configured and entitled as seen in:

Part 1 - Setting the Mirror Registry and OLM Catalog

Procedure

[Bastion host]

Step 1: Create a Mirror Registry

Follow - installation-creating-mirror-registry_samples-operator-alt-registry  

Note: You must ensure that your registry hostname is in the same DNS and that it resolves to the expected IP address. Otherwise, pulls will fail because cert x509 is for a hostname and not a public name.

Step 2: Authenticate the Mirror Registry

[Bastion host/Local host]

Now, let’s allow our cluster to reference images from the mirror registry we just built.

Follow installation-adding-registry-pull-secret_samples-operator-alt-registry.  

[Optional] For authenticating your mirror registry,  you need to configure additional trust stores for image registry access in our OCP cluster. You can create a ConfigMap in the openshift-config namespace and use its name in AdditionalTrustedCA in the image.config.openshift.io resource. This provides additional CAs that should be trusted when contacting external registries.

The ConfigMap key is the hostname + port of a registry for which this CA is to be trusted, and the base64-encoded certificate is the value for each additional registry CA to trust.

You can configure additional CAs with the following procedure:

bash
$ oc create configmap registry-config --from-file=<external_registry_address>=ca.crt -n openshift-config
$ oc edit image.config.openshift.io cluster
spec:
 additionalTrustedCA:
name: registry-config

Note: if your <external_registry_address> contains a ':5000',.it should be written as ‘..5000’ to avoid this error:

bash
error: "xxxxxxxxxx::5000" is not a valid key name for a ConfigMap: a valid config key must consist of alphanumeric characters, '-', '_' or '.' (e.g. 'key.name',  or 'KEY_NAME',  or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')
Step 3: Building an Operator Catalog Image
  1. Follow Building an Operator catalog image 

  2. Follow Mirroring the OpenShift Container Platform image repository 

Note: For now, we need to tell the architecture we want to mirror into the registry using the oc CLI. To achieve this during both steps, you need to pass the flag --filter-by-os='linux/amd64’:

oc adm catalog build --filter-by-os='linux/amd64’ ….
oc adm catalog mirror --filter-by-os='linux/amd64’ ….

This prevents a known error due to the docker registry not supporting multiple architectures manifests. 

[Optional] Mirror Images for HELM Deployment

After deploying the mirror image registry in step 2:

Mirror the images listed at: https://github.com/NVIDIA/gpu-operator/blob/master/bundle/manifests/gpu-operator.clusterserviceversion.yaml#L128 

yaml
relatedImages:
   - name: gpu-operator-image
     image: nvcr.io/nvidia/gpu-operator@sha256:1a1c95d392ea2c055b09c9d074ab4d577a42d5d338109234d7a868bf2ebdfa8d
   - name: dcgm-exporter-image
     image: nvcr.io/nvidia/k8s/dcgm-exporter@sha256:85016e39f73749ef9769a083ceb849cae80c31c5a7f22485b3ba4aa590ec7b88
   - name: container-toolkit-image
     image: nvcr.io/nvidia/k8s/container-toolkit@sha256:b3f48033d7d9e1d5703b6ecffe35d219a45a17bdcf85374d78924dee9c8917be
   - name: driver-image
     image: nvcr.io/nvidia/driver@sha256:324e9dc265dec320207206aa94226b0c8735fd93ce19b36a415478c95826d934
   - name: device-plugin-image
     image: nvcr.io/nvidia/k8s-device-plugin@sha256:45b459c59d13a1ebf37260a33c4498046d4ade7cc243f2ed71115cd81054cd85
   - name: gpu-feature-discovery-image
     image: nvcr.io/nvidia/gpu-feature-discovery@sha256:82e6f61b715d710c60ac14be78953336ea5dbc712244beb51036139d1cc8d526
   - name: cuda-sample-image
     image: nvcr.io/nvidia/k8s/cuda-sample@sha256:2a30fe7e23067bc2c3f8f62a6867702a016af2b80b9f6ce861f3fea4dfd85bc2
   - name: dcgm-init-container-image
     image: nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59

Then follow this guide: https://docs.openshift.com/container-platform/4.6/openshift_images/image-configuration.html to configure the `registrySources` of OpenShift to pull those images from the mirror registry.

Part 2 - Setting the YUM Mirror and Driver Container 

Note: Part  2 is  only needed for SRO or the NVIDIA GPU Operator; the NFD operator  does not need this step. 

For setting up a YUM mirror, we can choose to use Red Hat Satellite or create a custom-made mirror following.

The packages we need to host in our mirror are:

  • elfutils-libelf.${HOST_ARCH} 
  • elfutils-libelf-devel.${HOST_ARCH}
  • kernel-headers-${GPU_NODE_KERNEL_VERSION}
  • kernel-devel-${GPU_NODE_KERNEL_VERSION}
  • kernel-core-${GPU_NODE_KERNEL_VERSION}

These packages are needed to run the driver container, as can be seen at: https://gitlab.com/nvidia/container-images/driver/-/blob/master/rhel8/nvidia-driver .

Note: You can get the $HOST_ARCH and $GPU_NODE_KERNEL_VERSION from `oc describe node` on one of the nodes.

With the YUM-mirror in place, the next step is to add the repository configuration to the driver container: 

1. First, we create a ConfigMap containing the repository configuration file (my_mirror.repo)
bash
oc create configmap yum-repos-d --from-file /path/to/my_mirror.repo

2. Add the mirror repository to the operator buildConfig. For SRO this information must be added to: https://github.com/openshift-psap/special-resource-operator/blob/master/config/recipes/nvidia-gpu/manifests/1000-state-driver.yaml 

and:
https://github.com/openshift-psap/special-resource-operator/blob/master/config/recipes/nvidia-gpu/manifests/0000-state-driver-buildconfig.yaml 

For the NVIDIA-GPU-Operator  v1.4 and above (currently 1.5.2) and for versions before 1.4, follow the same instructions as SRO:

1. Create a configmap with custom repo list:

bash
oc create configmap repo-config -n gpu-operator-resources --from-file /path/to/my_mirror.repo
2. Specify repoConfig in values.yaml (If deploying from HELM:)
yaml
driver:
 repository: nvcr.io/nvidia
 image: driver
 version: "450.80.02"
 repoConfig:
   configMapName: repo-config
   destinationDir: /etc/yum.repos.d

Or Edit the driver.repoConfig entry at the ClusterPolicy CR

3. Deploy the operator via HELM

4. Verify ConfigMap is mounted successfully with driver container

Now you are ready to deploy the SRO / GPU-operator to your disconnected OCPO cluster.

We believe that Linux containers and container orchestration engines, most notably Kubernetes, are well positioned to power future software applications spanning multiple industries and verticals. Red Hat has embarked on a mission to enable some of the most critical workloads, like machine learning, deep learning, artificial intelligence, big data analytics, high-performance computing, and telecommunications, with Red Hat OpenShift. The PSAP team is supporting this mission across multiple footprints (public, private, and hybrid cloud), industries, and application types.

Troubleshooting

  • It is not mentioned in all the documentation, but it is good to start by deploying a medium-sized instance to host the registry.

Relevant links