Deploying and Running Ollama and Open WebUI in a ROSA Cluster with GPUs

Last edited: September 16, 2024
Published: September 11, 2024
Authors: Florian Jacquin

Tags:

This content is authored by Red Hat experts, but has not yet been tested on every supported configuration.

Red Hat OpenShift Service on AWS (ROSA) provides a managed OpenShift environment that can leverage AWS GPU instances. This guide will walk you through deploying Ollama and OpenWebUI on ROSA using instances with GPU for inferences.

Prerequisites

A Red Hat OpenShift on AWS (ROSA classic or HCP) 4.14+ cluster
OC CLI (Admin access to cluster)
ROSA CLI

Set up GPU-enabled Machine Pool

First we need to check availability of our instance type used here (g4dn.xlarge), it should be in same region of the cluster. Note you can use also Graviton based instance (ARM64) like g5g* but only on HCP 4.16+ cluster.

Using the following command, you can check for the availability of the g4dn.xlarge instance type in all eu-* regions:

Example output :

Here we see that this instance is available everywhere in 3 AZ except in eu-south-2 and eu-central-2.

With the region and zone known, use the following command to create a machine pool with GPU Enabled Instances. In this example I have used region eu-central-1c:

This command creates a machine pool named “gpu” with one replica using the g4dn.xlarge spot instance, which is an x86_64 instance with Nvidia T4 16GB GPU. It’s the cheapest GPU instance you can have at the moment (0.2114$/h at the moment); 16GB of VRAM is enough for running small/medium models.

Deploy Required Operators

We’ll use kustomize to deploy the necessary operators thanks to this repository provided by Red Hat COP (Community of Practices) link

Node Feature Discovery (NFD) Operator:
The NFD Operator detects hardware features and configuration in your cluster.
GPU Operator:
The GPU Operator manages NVIDIA GPUs drivers in your cluster.

Create Operator Instances

After the operators are installed, use the following commands to create their instances:

NFD Instance:
This creates an NFD instance for cluster.
GPU Operator Instance:
This creates a GPU Operator instance configured for AWS.

Deploy Ollama and OpenWebUI

Next, use the following commands to deploy Ollama for model inference and OpenWebUI as the interface for interacting with the language model.

Create a new project:
The following command deploys Ollama, sets up persistent storage, and allocates a GPU to the deployment:
The following command deploys OpenWebUI and sets up the necessary storage and environment variables and then expose the service with a route:

Verify deployment

Use the following commands to ensure all nvidia pods are either running or completed
All pods of llm namespace should be running
Check logs of ollama, it should detect inference compute card

Download a model

Download llama3.1 8B using Ollama CLI
You can check all models available on https://ollama.com/library

Accessing OpenWebUI

After deploying OpenWebUI, follow these steps to access and configure it:

Get the route URL:
Open the URL in your web browser. You should see the OpenWebUI login page. https://docs.openwebui.com/
Initial Setup:

The first time you access OpenWebUI, you’ll need to register.
Choose a strong password for the admin account.

Configuring Models:

Once logged in, go to the “Models” section to choose the LLMs you want to use.

Testing Your Setup:

Create a new chat and select one of the models you’ve configured.
Try sending a test prompt to ensure everything is working correctly.

Discover OpenWeb UI! You get lot of features like :

Model Builder
Local and Remote RAG Integration
Web Browsing Capabilities
Role-Based Access Control (RBAC)

You can read more about OpenWebUI here : https://docs.openwebui.com/features

Implement scaling

If you would like to give best experience for multiple users, for example to improve response time and token/s you can scale the Ollama app.

Note that here you should use the EFS (RWX access) storage class instead of the EBS (RWO access) storage class for the storage of ollama models. For instructions on how to set this up, please see this tutorial

Add new GPU node to machine pool
Change storage type for ollama app for using EFS
Scale ollama deployment

Implement downscaling

For cost optimization, you can scale you machine pool of GPU to 0 :

Uninstalling

Delete llm namespace
Delete operators
Delete machine pool

Conclusion

You now have Ollama and OpenWebUI deployed on your ROSA cluster, leveraging AWS GPU instances for inference.
This setup allows you to run and interact with large language models efficiently using AWS’s GPU instances within a managed OpenShift environment.
This approach represents the best of both worlds: the reliability and support of a managed OpenShift service and AWS, combined with the innovation and rapid advancement of the open-source AI community.
It allows organizations to stay at the forefront of AI technology while maintaining the security, compliance, and operational standards required in enterprise environments.

Deploying and Running Ollama and Open WebUI in a ROSA Cluster with GPUs

Prerequisites

Set up GPU-enabled Machine Pool

Deploy Required Operators

Create Operator Instances

Deploy Ollama and OpenWebUI

Verify deployment

Download a model

Accessing OpenWebUI

Implement scaling

Implement downscaling

Uninstalling

Conclusion

Interested in contributing to these docs?

Products

Tools

Try, buy & sell

Communicate

About Red Hat

Subscribe to our newsletter, Red Hat Shares

Red Hat legal and privacy links

Red Hat legal and privacy links