Training the LLM model

15 mins

Now that you have installed the notebook, configured the AWS CLI, and created an Amazon S3 bucket, let's run your model on the notebook.

In this resource, we will use the Hugging Face Transformers library to fine-tune a pre-trained model, i.e. prajjwal1/bert-tiny, on a small subset of the AG News dataset for text classification.

Note: Hugging Face is an open-source library providing a wide range of pre-trained models and tools for natural language processing tasks. AG News is a dataset consisting of news articles from various sources and it is commonly used for text classification tasks. prajwall1/bert-tiny is a very small version of the BERT model, which is a transformer-based model pre-trained on a large corpus of text data.

What will you learn?

How to train your LLM model using the Hugging Face Transformers library

What do you need before starting?

Met all prerequisites
Completed previous steps

How to train the LLM model

Now, we’ll cover the code you’ll need for the notebook. NOTE: remember to change the bucket name to your own bucket’s name. Instructions and descriptions are included in-line and denoted by the “#” character, so that you may copy the entire code block:

# install the necessary libraries
!pip install transformers datasets torch evaluate accelerate boto3

# import the necessary functions and APIs
import numpy as np
import evaluate
import boto3
import os
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# disable tokenizers parallelism warning
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# load a portion of the AG News dataset (500 examples)
dataset = load_dataset("ag_news")
small_dataset = dataset["train"].shuffle(seed=42).select(range(500))  

# load the model (prajjwal1/bert-tiny), tokenizer, and pre-trained model
model_name = "prajjwal1/bert-tiny"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

# define the function to tokenize text examples using the loaded tokenizer
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# apply the tokenize_function to the small_dataset using map function
tokenized_datasets = small_dataset.map(tokenize_function, batched=True)

# specify the training arguments, i.e. output directory, evaluation strategy, learning rate, batch size, number of epochs, weight decay, and load the best model at the end
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,  
    per_device_eval_batch_size=8,
    num_train_epochs=3,  
    weight_decay=0.01,
    load_best_model_at_end=True,
)

# load the accuracy metric from the evaluate library
metric = evaluate.load("accuracy")

# compute evaluate metrics by taking the eval predictions (logits and labels) and calculate the accuracy using the loaded metric
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# set up the training process by taking the model, training arguments, train and eval datasets, tokenizer and the compute_metrics function
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,  
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# start the training process using the configured trainer
trainer.train()

# save the model and tokenizer into model folder
model_save_dir = "./model"
tokenizer.save_pretrained(model_save_dir)
model.save_pretrained(model_save_dir)

# upload the saved model to s3 bucket
s3_client = boto3.client('s3')
bucket_name = 'llm-bucket-dsari' # please change this with your own bucket name
model_save_path = 'model/'

for file_name in os.listdir(model_save_dir):
    s3_client.upload_file(
        os.path.join(model_save_dir, file_name),
        bucket_name,
        model_save_path + file_name
    )

In summary, the code loads the dataset, tokenizes the text examples, sets up the training arguments, defines the evaluation metrics, and trains the model using the Trainer class. Finally, it saves the trained model and tokenizer locally and then uploads and saves them to the S3 bucket.

After you run it, you should see an output similar to the following (note that this may vary):

Screenshot of training output — Example of training output showing the model showing incremental changes in accuracy with each try.

Here, the results suggest that the model is learning and improving over the epochs based on the increasing accuracy and decreasing losses. However, the final accuracy of only 45.8% is low, indicating that the model's performance is suboptimal.

This is understandable, because the model is trained on a very small subset of the dataset, i.e. 500 examples, and we're also using a very small version of the BERT model, (prajjwal1/bert-tiny). With this in mind, you might want to try a larger dataset and larger model in your experiment. In addition, you could also fine-tune the hyperparameters to make it more optimal for the training process. We will cover this optional step in the next resource.

Possible error notes

You may see the following errors during the above steps. Many of them are expected in the scenario we’re using for this learning path, and are not cause for concern:

Unable to register cuDNN/cuFFT/cuBLAS factory...: These errors are informational and generally harmless. They indicate that multiple components are trying to initialize the same CUDA libraries, but it shouldn't affect the training process.
This TensorFlow binary is optimized to use available CPU instructions...: This is a warning from TensorFlow indicating that your CPU may not support certain instructions (AVX2, AVX512F, FMA), and since we're not using a GPU, this warning is expected.
TF-TRT Warning: Could not find TensorRT: TensorRT is NVIDIA's library for optimizing deep learning models. This warning just means that it's not available, which is fine since we're not using it.
Some weights of BertForSequenceClassification were not initialized...: This is a standard message when you're fine-tuning a model. It indicates that some parts of the model will be trained from scratch to adapt to your specific task, i.e. text classification on AG News.

Remember to save the notebook

Last but not least, do not forget to save your notebook. On your left tab, you will see the model folder where the results, i.e. the model and tokenizer, were saved. You can also see the results folder and, within it, the runs folder for every run you make. In addition, if you go to the S3 bucket in the console, you will see the output stored in the model folder:

Screenshot of the model folder. — Stored models in the Amazon S3 bucket.

With that, you’re ready to move on to the next resource.

Get more support

Troubleshoot with Red Hat support

How to run and deploy LLMs using Red Hat OpenShift AI on a Red Hat OpenShift Service on AWS cluster

Resources in this path

Training the LLM model

What will you learn?

What do you need before starting?

How to train the LLM model

Possible error notes

Remember to save the notebook

Get more support

Products

Tools

Try, buy, sell

Communicate

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links

How to run and deploy LLMs using Red Hat OpenShift AI on a Red Hat OpenShift Service on AWS cluster

View the resources in this path

Resources in this path

Training the LLM model

What will you learn?

What do you need before starting?

How to train the LLM model

Possible error notes

Remember to save the notebook

Get more support

Products

Tools

Try, buy, sell

Communicate

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links