Building LLM Cost and Performance Dashboard with Red Hat OpenShift AI on ROSA and Amazon Bedrock

Last edited: June 10, 2025
Published: June 9, 2025
Authors: Diana Sari,; Deepika Ranganathan

Tags:

This content is authored by Red Hat experts, but has not yet been tested on every supported configuration.

1. Introduction

As the LLM’s usage increases in the enterprise, not many realize that every LLM API call has two hidden costs: time and money. So while data scientists might argue about data accuracy, infrastructure engineers on the other hand, would need to know if that 2-second response time will scale, and if those $0.015 per thousand tokens cost will blow their quarterly budget, among others. In this guide, we will build a simple cost and performance dashboard for Amazon Bedrock models using Red Hat OpenShift AI (RHOAI) , which is our platform for managing AI/ML projects lifecycle, running on a Red Hat OpenShift Service on AWS (ROSA) cluster.

Here we will evaluate Claude 3.5 Sonnet v2 and Llama 3.3 70B, and we are focusing on three critical metrics for production LLM deployment: response latency, token usage, and operational cost. By running identical prompts through both models simultaneously, not only that we get to compare their responses, but we also get to see the metrics of their performance, i.e. response time (and average response time), token usage, and total session cost. In addition, we can also export the output into csv file for further analysis. The main objective here is to gather empirical performance data to inform model selection decisions based on your use case requirements.

Disclaimers: Note that this guide references Amazon Bedrock pricing and model specifications that are subject to change. AWS regularly updates model versions, pricing tiers, and API rate limits. Always verify current pricing on the Amazon Bedrock site and check their latest API documentation before relying on cost calculations for budgeting decisions. In addition, the token estimation algorithm used in this tool provides approximations only—actual token counts can vary by 20-30% depending on content type, language, and model-specific tokenization methods. Performance metrics such as response times are influenced by factors including network latency, regional endpoint proximity, concurrent API load, and service availability, among others. It is your responsibility to validate all cost estimates and performance benchmarks against your actual AWS billing statements and production requirements. Neither the authors of this implementation nor the service providers can be held responsible for budget overruns, unexpected charges, or decisions made based on the estimated metrics provided by this tool. The cost calculations shown are for demonstration purposes only and should not be used as the sole basis for financial planning or infrastructure decisions. Lastly, please note that user interfaces, model offerings, and notebook environments may change over time as Red Hat OpenShift AI and Amazon Bedrock evolve. Some screenshots, code snippets, and model identifiers may not exactly match current versions.

2. Prerequisites

A classic or HCP ROSA cluster
We tested this on a ROSA HCP cluster with version 4.18.14 and m5.8xlarge instance size for the worker nodes.
Amazon Bedrock
You could use any model of your choice via Amazon Bedrock. In this guide, we’ll use Anthropic Claude 3 Sonnet and Meta Llama 3 70B. Ensure that you enable the model (or the models of your choice) in your region and have right permissions for Amazon Bedrock.
RHOAI operator

You can install it using console per Section 3 in this tutorial or using CLI per Section 3 in this tutorial .
Once you have the operator installed, be sure to install DataScienceCluster instance, wait for a few minute for the changes to take effect, and then launch the RHOAI dashboard for next step.
We tested this tutorial using RHOAI version 2.19.0.

3. Assumption and calculation

3.1. Token and time measurement

Here we will be making several key assumptions. For starter, the token estimation uses a simplified heuristic of 1.3 tokens per word (based on the average English word-to-token ratio), which provides reasonable approximations without requiring model-specific tokenizers. Both models are configured with conservative parameters (temperature=0.1, top_p=0.9) to ensure consistent and deterministic outputs for the comparison.

The timing measurements capture end-to-end response latency, including network round-trip time (from your ROSA cluster) to Amazon Bedrock, model inference, and response parsing, and thus here what you’re seeing is the user-experienced latency rather than isolated model performance. That said, the location of your ROSA cluster and that of the Bedrock’s models might affect the output, e.g. whether they’re in the same region or not, etc. In addition, we have a simple prompting strategy to improve the response quality and reasoning clarity.

Note that all of these values and parameters can be adjusted or tweaked to meet your specific requirements.

3.2. Cost calculator

The cost calculator implements a token-based pricing model that is similar to Amazon Bedrock’s billing structure , with separate rates for input and output tokens. For Claude 3.5 Sonnet v2, pricing is set at $0.003 per 1K input tokens and $0.015 per 1K output tokens, while Llama 3.3 70B uses a symmetric pricing model at $0.00072 per 1K tokens for both input and output. Note that we’re using the pricing per June 2025 at the time of the writing, and pricing may change so please adjust these values accordingly. In essence, the calculator estimates token counts for both the input prompt and generated response, then applies the model-specific pricing tiers to compute total cost per request.

3.3 Performance metrics visualization

The metrics tracking system tracks the history of all model interactions during a session, storing response time, token usage, and cost data for each query. This data then feeds into a multi-panel matplotlib dashboard that visualizes four key insights, e.g. response time trends across queries (line plots), average response time comparison (bar charts), token usage patterns over time, and cumulative session costs.

4. Dashboard code and output

Now that we understand the assumption and the calculation done for this guide, let’s proceed to create the dashboard in your Jupyter notebook. Before we proceed, please be sure that you have your AWS credentials handy such as AWS Access Key ID and AWS Secret Access Key to access Amazon Bedrock.

Next, on the RHOAI dashboard, launch a Jupyter notebook instance. In this example, we will be using TensorFlow 2025.1 image with Medium container size for the notebook. This might take a few minutes to provision.

Copy the code below in one cell. Replace the placeholder env vars with your AWS credentials and then run it.

On the query section, either select one of the sample questions or write a new one. So say for example, this below is my question:

str_question

And then hit the blue Compare Models button, and you’ll get something like below (note that results may vary):

str_answer

Next, click the turqoise Show Metrics button, and you’ll get the metrics like below (note that results may vary):

metrics

And finally, hit the green Export Data button to export this into csv file like the one below (note that results may vary):

export

5. Future research

There are many things you can improve from this guide. First of all, instead of using hardcoded AWS credentials, you might want to use something else safer such as IRSA for example. You could also create some sort of panel widget on top of the notebook to make it easier for users to choose AWS region, different model versions, and adjust the temperature/max_tokens parameters without modifying the code, so it becomes more accessible to non-technical users. In addition, you could also create some sort of saving (and loading) functionality to preserve the comparison history upon notebook restarts so users can resume from previous analysis sessions if they wanted to.

Building LLM Cost and Performance Dashboard with Red Hat OpenShift AI on ROSA and Amazon Bedrock

1. Introduction

2. Prerequisites

3. Assumption and calculation

3.1. Token and time measurement

3.2. Cost calculator

3.3 Performance metrics visualization

4. Dashboard code and output

5. Future research

Interested in contributing to these docs?

Products

Tools

Try, buy & sell

Communicate

About Red Hat

Subscribe to our newsletter, Red Hat Shares

Red Hat legal and privacy links

Red Hat legal and privacy links