In certain cases, OpenShift 4.x users may want to run large, distributed workloads in the latest version of OpenShift 4.x, especially if their cluster’s bare metal machines’ hardware is not adequate enough for their job, and the hardware currently cannot be upgraded or will not be upgraded for the time being. However, when users consider all the machines’ hardware together, they may have the necessary hardware resources to run such workloads.

A “good” workload that users can carry out is one that involves the use of TensorFlow, which is one of the most popular Machine Learning (ML) and Artificial Intelligence (AI) frameworks used today by data scientists, and also by developers who want to assess the potential performance impacts of their latest development work. To take this a step further, because of how applicable TensorFlow is for users, in this guide, I aim to demonstrate how to run distributed TensorFlow GPU workloads in OpenShift 4.x in the cloud using AWS. However, this process should also work for different types of cloud service providers (CSPs) and for bare metal clusters.

OpenShift 4.x Prerequisites

Launching an OpenShift 4.x Cluster

The first prerequisite of this two-part guide is having an OpenShift cluster up and running in AWS, GCP, or Azure, where your cluster uses the most current, stable release of OCP 4.6 or later. You can install a cluster on bare metal by following this guide, or you can use the following instructions to install (1.) a cluster on AWS, (2.) a cluster on Azure, or (3.) a cluster on GCP.

Launch at Least Two GPU MachineSets or Two GPU Nodes in your Cluster

Launch at least two GPU instances of your choosing. These GPU instances can have 1 GPU each if your aim is to simply test the distributed workload process. If you need assistance with launching a GPU instance, you can create a Machine Set YAML file to do so, or you can use another method of your choosing. An example of how to create a Machine Set can be found here. You can then follow this guide to learn how to scale the number of GPU machines via `oc scale` so that you don’t have to create a new Machine Set YAML every time you want to launch another node.

Determine How you want to Install TensorFlow

In the next two sections, with the following section being completely optional, I describe two different ways to install TensorFlow 2.x. The first way is to build the library yourself (which I recommend for those who want the bleeding edge TensorFlow or those who want the library optimized for their specific hardware) and the second way is to install the library with pip3, the latter of which takes significantly less time.

[Optional Step] Preparation Steps for a Custom TensorFlow Build

Accessing a GPU Machine Outside your Cluster to Build a GPU-Accelerated TensorFlow Package

Before you begin the process of building the `tensorflow` pip package, you will need to ensure that you have access to an outside GPU machine with CUDA capable GPUs (i.e., a machine with NVIDIA cards that does not belong to your cluster). The machine itself can be a non-OCP bare metal one or a non-OCP one in any supported cloud.

If you’re using AWS or a similar CSP, I highly recommend launching your outside GPU instance with at least 24 CPUs. I used a `g4dn.4xlarge` AWS instance, for example, which has 64 CPUs. Since the TensorFlow GPU build process partially involves using CPUs, you will want a large number of real cores to shorten the build time from potentially 6+ hours to a mere 1-3 hours. Even better, using a machine with multiple GPUs, too, will significantly speed up the process.

Setting the Maximum Number of Files that can be Opened

On your build host, before you begin building the image, make sure to check the container’s ulimit caps for the number of open files. For Podman, for example, you will want to run:

$ cat /etc/containers/containers.conf

If this file doesn’t exist on your GPU machine, that’s perfectly okay. You can safely create this file later.

Note that the default value for the open file limit is typically 1024, whether `containers.conf` exists or not. Unfortunately, 1024 is too small for building TensorFlow since TensorFlow needs more than 1024 files open at once. Thus, you will want to set this value to something significantly higher. I chose 65,535 since I am using a temporary AWS instance that I deleted after I built the image. If you already have the `containers.conf` file and you’re using Podman, add the following two lines to that file if they don’t already exist:

default_ulimits = [ "nofile=65535:65535", ]

If your file doesn’t exist, then create a file with the above content. If your file does exist and `default_ulimits` is already set, then change `nofile` to 65535.

Build the TensorFlow 2.x GPU Image and Push it to Your Own Image Repo

First clone the repository:

$ export BLOG_ARTIFACTS=/tmp/blog-artifacts
$ git clone ${BLOG_ARTIFACTS}

Once you’ve cloned the repo, you will need to edit two files before building your image, whether you’re using the official pip3 binary to install TensorFlow or if you’re using your own custom build:  `cuda.repo.template` and `nvidia-ml.repo.template` under the `${BLOG_ARTIFACTS}/Dockerfiles` directory. Modify the files to point the `baseurl` to the CUDA repo URLs.

Next, choose which version of TensorFlow you want to build, along with the version of CUDA you want to use with it, by modifying the Dockerfile’s CUDA-related environment variables. The default CUDA version used is 11. If you want to use a different version of CUDA, supported configurations for CUDA can be found in the TensorFlow Documentation for building from source.

When ready, run the following commands:

$ export TF_IMAGE_TAG=""
$ export TF_DOCKERFILE="Dockerfile.ubi8-tf-pip3" #change if custom TF
$ podman build -t ${TF_IMAGE_TAG} -f Dockerfiles/${TF_DOCKERFILE} .
$ podman login
$ podman push ${TF_IMAGE_TAG}

Build the Modified National Institute of Standards and Technology (MNIST)  or Fashion MNIST Workload Image from your TensorFlow 2.x GPU Image

I have two workloads that can be used for assessing distributed TensorFlow GPU performance. The first workload is an instance of MNIST training, which is the process of training an algorithm to classify handwritten digits. (For more information on the MNIST dataset, you can visit this web page.) The second workload is an instance of the Fashion MNIST training, which, rather than classifying handwritten digits, aims to classify articles of clothing.

The MNIST Python script -- used to perform this MNIST classification -- is based upon this guide in the official TensorFlow Documentation, while the Fashion MNIST Python script is based upon this guide.

The MNIST and Fashion MNIST images both pull in your TensorFlow 2.x image and pull in the necessary Python 3 scripts for running the training. The reason for creating a separate TensorFlow image beforehand is to enable you to run TensorFlow using whichever benchmark you desire without having to rebuild TensorFlow every time you wish to use a different benchmark. Of course, you can optionally add and edit lines at the end of your TensorFlow Dockerfile that include your desired benchmark(s), but creating a separate TensorFlow image may allow you to organize your work better.

$ export DATASET_TAG=""
$ export DATASET_DOCKERFILE="Dockerfile.mnist"
$ podman build -t ${DATASET_TAG} -f Dockerfiles/${DATASET_DOCKERFILE}
$ podman push ${DATASET_TAG}

Create the tf-distributed Namespace

Now that you’ve built your relevant images, you’re ready to create a namespace for your OCP work. I chose the name `distributed-tf`, but you can choose whichever name you’d like, as long as you update the mentioned files later on.

$ oc new-project distributed-tf

Install Node Feature Discovery, the NVIDIA GPU Operator, and Open Data Hub (or Kubeflow)

You will need to install Node Feature Discovery (NFD), the NVIDIA GPU Operator, and Open Data Hub / Kubeflow. Without these operators, you cannot run the distributed GPU workload. For more information on how to install these operators and for information on what these operators do, please visit this section of the documentation.

Run the Distributed Workload

The Open Data Hub Operator (or the Kubeflow Operator itself, since the ODH Operator is a fork of the Kubeflow Operator) installs a CRD called `TFJob`. This CRD allows us to dynamically set an environment variable called `TF_CONFIG`, which contains information about the nodes we plan to use for distributed learning. For example, `TF_CONFIG` contains each worker’s IP address, the number of each worker’s GPUs, etc.. Without this environment variable, the TFJob does not know which workers to use and thus cannot properly distribute the workload across your desired workers.

To run either MNIST distributed workload, you will need to edit the `manifests/mnist-tfjob.yaml` file or the `manifests/fashion-mnist-tfjob.yaml`, depending on which `TFJob` you wish to run. For example, for the MINST distributed workload, you can set the number of workers (“replicas”) you wish to use and set your image related info. Once you’re ready, run:

$ oc create -f manifests/mnist-tfjob.yaml

The `TFJob` itself may take a while to launch the first time because the TensorFlow base image that you created for your MNIST image is rather large.

When the `TFJob` is ready and running, you should see a number of pods equal to the number of workers you chose. You can view one of the pod’s logs after the training has completed. For example, here is the output from one of my TFJob pods:

$ oc get pods | grep tf-multi

$ oc logs pod/<one-of-your-outputted-tfjob-pod-names>
2021-02-12 18:42:45.473514: I tensorflow/core/common_runtime/gpu/] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 13501 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2021-02-12 18:42:45.483618: I tensorflow/core/distributed_runtime/rpc/] Initialize GrpcChannelCache for job worker -> {0 -> tfjob-multi-gh5w5-worker-0.distributed-tf.svc:2222, 1 -> tfjob-multi-gh5w5-worker-1.distributed-tf.svc:2222, 2 -> tfjob-multi-gh5w5-worker-2.distributed-tf.svc:2222}
2021-02-12 18:42:45.483973: I tensorflow/core/distributed_runtime/rpc/] Started server with target: grpc://tfjob-multi-gh5w5-worker-0.distributed-tf.svc:2222
Downloading data from
11493376/11490434 [==============================] - 0s 0us/step
2021-02-12 18:42:47.911498: W tensorflow/core/framework/] Allocation of 188160000 exceeds 10% of free system memory.
2021-02-12 18:42:48.456117: W tensorflow/core/framework/] Allocation of 188160000 exceeds 10% of free system memory.
2021-02-12 18:42:48.781824: W tensorflow/core/grappler/optimizers/data/] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
 key: "Toutput_types"
 value {
   list {
     type: DT_FLOAT
     type: DT_INT64
attr {
 key: "output_shapes"
 value {
   list {
     shape {
       dim {
         size: 28
       dim {
         size: 28
     shape {
2021-02-12 18:42:48.808413: W tensorflow/core/framework/] Allocation of 188160000 exceeds 10% of free system memory.
2021-02-12 18:42:49.192055: I tensorflow/compiler/mlir/] None of the MLIR optimization passes are enabled (registered 2)
2021-02-12 18:42:49.192655: I tensorflow/core/platform/profile_utils/] CPU Frequency: 2499995000 Hz
2021-02-12 18:42:50.673844: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-02-12 18:42:51.453592: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-02-12 18:42:51.871819: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
Epoch 1/3
700/700 [==============================] - 54s 68ms/step - loss: 2.1137 - accuracy: 0.4549
Epoch 2/3
700/700 [==============================] - 46s 66ms/step - loss: 0.8979 - accuracy: 0.8299
Epoch 3/3
700/700 [==============================] - 46s 66ms/step - loss: 0.5035 - accuracy: 0.8741

I used three GPU machines, hence we see this part of the output, where we have 1 worker per pod:

Initialize GrpcChannelCache for job worker -> {0 -> tfjob-multi-gh5w5-worker-0.distributed-tf.svc:2222, 1 -> tfjob-multi-gh5w5-worker-1.distributed-tf.svc:2222, 2 -> tfjob-multi-gh5w5-worker-2.distributed-tf.svc:2222}

We can also see that the CUDA libraries are loaded successfully, and the accuracy and performance of the algorithm per epoch is shown.


AI/ML, How-tos, Operators, GPU, partners, Machine Learning, nvidia

< Back to the blog