Running and Deploying LLMs using Red Hat OpenShift AI on ROSA cluster and Storing the Model in Amazon S3 Bucket

Last edited: May 28, 2024
Published: May 23, 2024
Authors: Diana Sari

Tags:

This content is authored by Red Hat experts, but has not yet been tested on every supported configuration.

1. Introduction

Large Language Models (LLMs) are a specific type of generative AI focused on processing and generating human language. They can understand, generate, and manipulate human language in response to various tasks and prompts.

This guide is a simple example on how to run and deploy LLMs on a Red Hat OpenShift Services on AWS (ROSA) cluster, which is our managed service OpenShift platform on AWS, using Red Hat OpenShift AI (RHOAI) , which is formerly called Red Hat OpenShift Data Science (RHODS) and is our OpenShift platform for managing the entire lifecycle of AI/ML projects. And we will utilize Amazon S3 bucket to store the model output. In essence, here we will first install RHOAI operator and Jupyter notebook, create the S3 bucket, and then run the model.

Please note that the UI may change from time to time so what you see in the snippets below might change as well.

2. Prerequisites

ROSA cluster
- In this case I’m using a single-AZ ROSA 4.15.10 cluster with m5.4xlarge node with auto-scaling enabled up to 10 nodes. The cluster has 64 vCPUs with ~278Gi memory.
- Note that ROSA and RHOAI also support GPU , however, for the sake of simplicity, I’ll only be using CPU for compute in this guide.
- Please be sure that you have cluster admin access of the cluster.
oc cli

3. Installing RHOAI and Jupyter notebook

From your cluster console, go to OperatorHub under Operators from the left tab, and put Red Hat OpenShift AI into the search query and install the operator. The most recent version of the operator in the time of writing is 2.9.0, and here I choose the default option for the installation, and thus the operator will be installed in the redhat-ods-operator namespace which will be created automatically upon installation.

RHOAI-operator

Afterward, create DataScienceCluster(DSC) instance. I also choose the default option for the installation, and thus the name of the DSC instance is called default-dsc.

Install-DSC-1

This is how it looks like once the instance is created:

Install-DSC-2

Next, go to your cluster console and click the 9-boxes icon on the upper right side (next to the bell/notification icon), and select Red Hat OpenShift AI to launch it on the next tab.

RHOAI-ROSA-shortcut

Once launched, go to the server page and on the left tab, look under Applications and select Enabled. And then launch Jupyter to see the notebook options available to install. In this case, I choose TensorFlow 2024.1 and I leave the size of container to small which is the default. And finally, click Start server button at the bottom. Note that if the server failed to start, then you might want to scale up your worker nodes.

RHOAI-notebooks

The server installation will take several minutes. Once installed, you’ll see the main page of your Jupyter notebook like below and select a Python 3.9 notebook to start the next section.

RHOAI-start-success

This below is how the notebook looks like on the new tab:

4. Creating and granting access to S3 bucket

There are actually several ways to go about granting S3 access to the pods running in your ROSA cluster, e.g. setting the credentials as environment variables in the notebook, using pod identity/IRSA (IAM Roles for Service Accounts) to authenticate the pods to S3, and installing AWS CLI in the cluster, among others. Using pod identity/IRSA is the better option here since you don’t need to manage the credentials yourself, however, for the sake of simplicity, in this guide, I’ll install the CLI in the cluster instead and then using aws configure to provide the credentials. That said, be sure that you have your AWS access key and secret access key handy. You could create new keys in the IAM section from the AWS console if you lost yours.

Now login to your cluster and go to the namespace where your notebook is located.

oc project rhods-notebooks

And run the following and make sure that your pods are running.

oc get pods

Once you have the name of the pod (in may case it is called jupyter-nb-admin-0), exec into it:

oc exec -it jupyter-nb-admin-0 -- /bin/bash

Next, let’s install the AWS CLI in that pod:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install -i ~/.local/aws-cli -b ~/.local/bin

Then modify your PATH environment:

export PATH=~/.local/bin:$PATH

And finally, verify the version to make sure that it is correctly installed:

aws --version

Once it is correctly installed, be sure that you have your AWS Access Key ID and AWS Secret Access Key ready, and run the following command:

aws configure

You would want to selct the region where your cluster is located. You could verify the configuration by running simple command such as listing the S3 buckets like below:

aws s3 ls

Once the credentials matter is sorted, let’s create an S3 bucket in your AWS account. Again, there are many ways to go about this. The easiest would be to go to your AWS console and create the bucket in your region from there and leave all the settings to default. Alternatively, you can run this command to create a bucket (in my case I named it llm-bucket-dsari and my cluster region is us-west-2):

aws s3 mb s3://llm-bucket-dsari --region us-west-2

5. Training LLM model

Now that you have the notebook installed, AWS CLI and credentials configured, and an S3 bucket created, let’s run your model on the notebook. In this guide, we will use Hugging Face Transformers library to fine-tune a pre-trained model, i.e. prajjwal1/bert-tiny, on a small subset of the AG News dataset for text classification. Hugging Face (also referred to as 🤗) is an open-source library providing a wide range pre-trained models and tools for natural language processing tasks. AG News is a dataset consisting of news articles from various sources and it is commonly used for text classification tasks. prajwall1/bert-tiny is a very small version of the BERT model, which is a transformer-based model pre-trained on a large corpus of text data.

The code for the notebook and the explanation of each on the commented lines (and please change the bucket name to your own bucket name):

# install the necessary libraries
!pip install transformers datasets torch evaluate accelerate boto3

# import the necessary functions and APIs
import numpy as np
import evaluate
import boto3
import os
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# disable tokenizers parallelism warning
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# load a portion of the AG News dataset (500 examples)
dataset = load_dataset("ag_news")
small_dataset = dataset["train"].shuffle(seed=42).select(range(500))  

# load the model (prajjwal1/bert-tiny), tokenizer, and pre-trained model
model_name = "prajjwal1/bert-tiny"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

# define the function to tokenize text examples using the loaded tokenizer
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# apply the tokenize_function to the small_dataset using map function
tokenized_datasets = small_dataset.map(tokenize_function, batched=True)

# specify the training arguments, i.e. output directory, evaluation strategy, learning rate, batch size, number of epochs, weight decay, and load the best model at the end
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,  
    per_device_eval_batch_size=8,
    num_train_epochs=3,  
    weight_decay=0.01,
    load_best_model_at_end=True,
)

# load the accuracy metric from the evaluate library
metric = evaluate.load("accuracy")

# compute evaluate metrics by taking the eval predictions (logits and labels) and calculate the accuracy using the loaded metric
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# set up the training process by taking the model, training arguments, train and eval datasets, tokenizer and the compute_metrics function
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,  
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# start the training process using the configured trainer
trainer.train()

# save the model and tokenizer into model folder
model_save_dir = "./model"
tokenizer.save_pretrained(model_save_dir)
model.save_pretrained(model_save_dir)

# upload the saved model to s3 bucket
s3_client = boto3.client('s3')
bucket_name = 'llm-bucket-dsari' # please change this with your own bucket name
model_save_path = 'model/'

for file_name in os.listdir(model_save_dir):
    s3_client.upload_file(
        os.path.join(model_save_dir, file_name),
        bucket_name,
        model_save_path + file_name
    )

In summary, the code loads the dataset, tokenizes the text examples, sets up the training arguments, defines the evaluation metrics, and trains the model using the Trainer class. Finally, it saves the trained model and tokenizer locally and then upload and save them to the S3 bucket.

After you run it, you should see an output similar to the following (note that this may vary):

Output

Here the results suggest that the model is learning and improving over the epochs based on the increasing accuracy and decreasing losses. However, the final accuracy of only 45.8% is low indicating that the model’s performance is suboptimal. This is understandable because the model is trained on a very small subset of the dataset, i.e. 500 examples, and we’re also using a very small version of BERT model, i.e. prajjwal1/bert-tiny. That said, you might want to try larger dataset and larger model in your experiment if you like. In addition, you could also fine-tune the hyperparameters to make it more optimal for the training process (FYI, I have a bonus section on this one at the end if you’re interested in doing it).

Some error notes that you might see:

Unable to register cuDNN/cuFFT/cuBLAS factory…: These errors are informational and generally harmless. They indicate that multiple components are trying to initialize the same CUDA libraries, but it shouldn’t affect the training process.
This TensorFlow binary is optimized to use available CPU instructions…: This is a warning from TensorFlow indicating that your CPU may not support certain instructions (AVX2, AVX512F, FMA), and since we’re not using a GPU, this warning is expected.
TF-TRT Warning: Could not find TensorRT: TensorRT is NVIDIA’s library for optimizing deep learning models. This warning just means it’s not available, which is fine since we’re not using it.
Some weights of BertForSequenceClassification were not initialized…: This is a standard message when you’re fine-tuning a model. It indicates that some parts of the model will be trained from scratch to adapt to your specific task, i.e. text classification on AG News.

Last but not least, do not forget to save your notebook. On your left tab, you would see the model folder where the results, i.e. the model and tokenizer, were saved. You can also see results folder where inside it you’ll see runs folder for every runs you make. In addition, if you go to the S3 bucket in the console, you will see the output stored in the model folder:

6. Future research

This is a very simple guide aimed to get you started with RHOAI on ROSA. As mentioned previously, you could improve the accuracy by increasing the dataset size and running a more robust model, and we can leverage GPU to support that. Another idea is to extend the workload to AWS SageMaker and/or AWS Lambda. In addition, RHOAI itself has a section where you can run the code as pipeline which I haven’t had a chance to venture at this time. All of these would be great topic for future blogs.

Bonus section: Performing hyperparameter tuning

This is an optional section so feel free to skip it.

There are many ways to go about performing hyperparameter tuning for your model to improve the model accuracy. Here I’ll be using optuna , which is a popular library to optimize hyperparameter. It essentially allows you to define the search space for each hyperparameter and automatically finds the best combination based on the specified objective.

This is the code example that you can run on the notebook:

!pip install transformers datasets torch evaluate accelerate boto3 optuna

import numpy as np
import evaluate
import optuna
import boto3
import os
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

os.environ["TOKENIZERS_PARALLELISM"] = "false"

dataset = load_dataset("ag_news")
small_dataset = dataset["train"].shuffle(seed=42).select(range(500)) 

model_name = "prajjwal1/bert-tiny"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = small_dataset.map(tokenize_function, batched=True)

def compute_metrics(eval_pred):
    metric = evaluate.load("accuracy")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# define the objectives, i.e. the hyperparameters to tune, the training arguments, train the model, and evaluate them
def objective(trial):
    learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 5e-5)
    per_device_train_batch_size = trial.suggest_categorical("per_device_train_batch_size", [4, 8, 16])
    num_train_epochs = trial.suggest_int("num_train_epochs", 2, 4)
    
    training_args = TrainingArguments(
        output_dir="./results",
        eval_strategy="epoch",
        save_strategy="epoch",
        learning_rate=learning_rate,
        per_device_train_batch_size=per_device_train_batch_size,
        num_train_epochs=num_train_epochs,
        weight_decay=0.01,
        load_best_model_at_end=True,
    )

    metric = evaluate.load("accuracy")

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets,
        eval_dataset=tokenized_datasets,  
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )
    
    trainer.train()
    
    eval_metrics = trainer.evaluate()
    return eval_metrics["eval_accuracy"]

# run hyperparameter search
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=9)

# find the best parameter and accuracy
best_params = study.best_params
best_accuracy = study.best_value

print("Best hyperparameters:", best_params)
print("Best accuracy:", best_accuracy)

# train the model with the best hyperparameters
best_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)
trainer = Trainer(
    model=best_model,
    args=TrainingArguments(
        output_dir="./results",
        eval_strategy="epoch",
        save_strategy="epoch",
        learning_rate=best_params["learning_rate"],
        per_device_train_batch_size=best_params["per_device_train_batch_size"],
        num_train_epochs=best_params["num_train_epochs"],
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
trainer.train()

model_save_dir = "./model"
tokenizer.save_pretrained(model_save_dir)
best_model.save_pretrained(model_save_dir)

s3_client = boto3.client('s3')
bucket_name = 'llm-bucket-dsari' # please change this to your bucket name
model_save_path = 'model/'

for file_name in os.listdir(model_save_dir):
    s3_client.upload_file(
        os.path.join(model_save_dir, file_name),
        bucket_name,
        model_save_path + file_name
    )

Here we are using the same dataset and the same model, however, the main difference between this code and the one before is that here we define an objective function that takes an optuna trial as input and we create Trainer instance with the tuned hyperparameters and train the model. Then, we create an optuna study and optimize the objective function and lastly, we retrieve the best hyperparameter and its accuracy. Note that the code could run for a bit longer than before as it keeps running trials and the final results may vary, but in my case, the final epoch reach 94.6% accuracy.

Tuned-output

Note that this is just an example of hyperparameter tuning and there are many other methods that you can try such as grid and random search , Bayesian optimization , and so forth. And the good thing is that many ML frameworks and libraries already have built-in utilities for hyperparameter tuning which makes it easier to apply in practice.

Running and Deploying LLMs using Red Hat OpenShift AI on ROSA cluster and Storing the Model in Amazon S3 Bucket

1. Introduction

2. Prerequisites

3. Installing RHOAI and Jupyter notebook

4. Creating and granting access to S3 bucket

5. Training LLM model

6. Future research

Bonus section: Performing hyperparameter tuning

Interested in contributing to these docs?

Products

Tools

Try, buy & sell

Communicate

About Red Hat

Subscribe to our newsletter, Red Hat Shares

Red Hat legal and privacy links

Red Hat legal and privacy links