Creating RAG Chatbot using TinyLlama and LangChain with Red Hat OpenShift AI on ARO
This content is authored by Red Hat experts, but has not yet been tested on every supported configuration.
1. Introduction
Retrieval-Augmented Generation (RAG) is a technique to enhance Large Language Models (LLMs) to retrieve relevant information from a knowledge base before generating responses, rather than relying solely on their training. LangChain is a framework for developing applications powered by language models. It provides tools and APIs that make it easier to create complex applications using LLMs, such as using RAG technique to enable the chatbot to answer questions based on the provided document.
This tutorial is a simple guide on how to create RAG chatbot that can provide sufficient response when asked about ARO based on official ARO documentation , which consists of 421 PDF pages at the time of writing. We will be using Red Hat OpenShift AI (RHOAI), formerly called Red Hat OpenShift Data Science (RHODS), which is an OpenShift platform for AI/ML projects management, and we will be running this on an Azure Red Hat OpenShift (ARO) cluster, which is our managed service OpenShift platform on Azure.
Here we will create a chatbot using
TinyLlama
model and we will use several key components from LangChain for document loading (PyPDFLoader), text splitting (RecursiveCharacterTextSpliter), vector store (FAISS), retrieval chain (RetrievalQA), and prompt templates (PromptTemplate), to help build our chatbot. And at the end of this tutorial, there will be an optional section to create another RAG system using
GPT-4
model via
Azure OpenAI Service
, and from there we will compare the responses from both systems.
Disclaimer: When interacting with chatbots and AI language models, please be aware that while these systems include content filters and safety features, they are not infallible. It is your responsibility to use these tools appropriately and ensure your interactions are suitable. Neither the author of this tutorial nor the service providers can be held responsible for any unexpected or inappropriate responses. By proceeding, you acknowledge that AI responses can be unpredictable and may occasionally contain inaccuracies or deviate from the expected behavior. It is important to review and verify any critical information or advice received from these systems before acting upon it. In addition, please note that user interfaces may change over time as the products evolve. Some screenshots and instructions may not exactly match what you see.
2. Prerequisites
- An ARO cluster (>= version 4.15)
- You can deploy it manually or using Terraform .
- I tested this using ARO version 4.15.27 with
Standard_D16s_v3instance size for both the control plane and the worker nodes.
- RHOAI operator
- You can install it using console per Section 3 in this tutorial or using CLI per Section 3 in this tutorial .
- I tested this tutorial using RHOAI version 2.13.1.
3. Creating the RAG Chatbot
Once we have the RHOAI operator installed and the DataScienceCluster instance created, please proceed to RHOAI dashboard and launch a Jupyter notebook instance. In this case, I’m using TensorFlow 2024.1 image with Medium container size for the notebook.
Next, we will be installing the required packages and importing the necessary libraries for the RAG system, and then we are going to do the following:
Step 1 – PDF Processing and Chunking
Here we will download the ARO documentation and break it into smaller “chunks” of text. Chunking is a technique where large documents are split into smaller, manageable pieces, and it is a crucial process since language models have token limits and they work better with smaller, focused pieces of text.Step 2 – Vector Store Creation
FAISS (Facebook AI Similarity Search) is a library that efficiently stores and searches for text embeddings, which are numerical representations of text that capture semantic meaning. Here we convert each text chunk into embeddings using MiniLM model and these embeddings are later stored in FAISS, which allows for quick similarity searches when answering questions.Step 3 – Language Model Setup
Here we set up TinyLlama as primary language model and GPT-2 as fallback. TinyLlama is an open-source small language model that is specifically trained for chat/instruction-following and can handle context and generate coherent responses while being lightweight. It is smaller but efficient language model. GPT-2 serving as the fallback model is an older but reliable model by OpenAI that runs on CPU.Step 4 – Question Classification
Next, we will implement prompt chaining by first categorizing the questions into certain types, i.e. benefits, technical, etc. using regex patterns. And based on the type, a specific template is then chosen. The relevant documents are then retrieved, and both the context and the question are combined into a prompt which was then processed by the LLM.Step 5 – Response Formatting
Here we are going to format the response with proper HTML styling and error handling.Step 6 – User Interface (UI) Creation
In this step, we will create an interactive UI interface usingIPythonwidgets for question input and response display.Step 7 – Sytem Initialization
Lastly, we will initialize the complete RAG system by combining all components (vector store, language model, and question-answering chain) and launch the interface.
On Jupyter notebook, copy this code below into one cell:
Run the cell and once completed, you can now run your ARO-related questions and see how well the model responds. Please note that it might take a few minutes before it responds.
This is what I got when I typed in a question of “what is aro?” (please note that the outcomes may vary):

On your first run, you might be getting CUDA/TensorFlow warnings as the system detects some CUDA (NVIDIA GPU) related components that are already registered. However, the code explicitly uses CPU (device_map="cpu"), so these warnings will not affect the functionality.
4. Future research
Note that this is a simple tutorial on creating RAG chatbot that is using a generic model yet able to provide answers based on particular documentation, which in this case is ARO product documentation. We are using LangChain APIs in the code and if you’re interested in reading more about them, kindly take a look at LangChain’s official documentation here and also this fantastic blog here .
Lastly, there are many ways to go about improving this RAG system. You could, for example, use a more robust or reliable model to improve accuracy, such as GPT-4, GPT-3.5, etc. Additionally, you could also integrate the Q&A chat with Slack or other chat platforms to make it more accessible to your end users.
Bonus section: Comparing responses with Azure OpenAI’s GPT-4 model
This is an optional section so feel free to skip it.
As mentioned previously, one way to improve the accuracy of the response is to use a more reliable or advanced model. For instance, we can leverage Azure OpenAI services that allows us to utilize OpenAI’s high-end models. That said, in this section, we will try to create a system comparison that allows you to compare the responses from the TinyLlama-based model with those from an advanced model such as OpenAI’s GPT-4 model.
To enable this system comparison, we need to first create Azure OpenAI service. You can do so by going to the Azure portal and search for “Azure OpenAI”, and once you click the icon/links, you will be redirected to the Azure AI services page like the one shown below. Next, click the Create Azure OpenAI button on the center of the page, which will lead you to another page where you can create and customize your Azure OpenAI instance.

On the Create Azure OpenAI page (not displayed here), please select the same resource group where your ARO cluster resides and the same region as the resource group. Then name your instance and select the pricing tier that suits your need. In my case, I named it openai-rag-aro-v0 and I chose Standard S0 for the pricing tier. On the next page, I leave the network selection to default which allows internet access to the resource. Then, click the Submit button once you reviewed the configuration. Once your deployment is complete, click Go to resource button.
And then on the next page as shown below, click Explore Azure AI Foundry portal button (or you can also click the Go to Azure AI Foundry portal links tab on the upper left).

And once you’re redirected to the Azure OpenAI Foundry portal page as shown below, click Model catalog tab from the sidebar navigation on the left side.

And once you get into the Model catalog page as shown below, click on gpt-4 chat completion from the model catalog.

You will then be redirected to gpt-4 model deployment page where you can name your model deployment and select the deployment type that meets your need (not displayed). In my case, I leave the name to the default which is gpt-4 and I select Global Standard as the deployment type. And lastly, select Deploy button to start the deployment.

Once the model is deployed, you will have details on your deployment’s info and its endpoint’s target URI and keys (not displayed). We will use the latter two so please keep them handy.
Next, we will install necessary packages and import required libraries and then we will create an enhanced RAG system with the following steps in summary:
Step 1 – Azure OpenAI Integration
Here we are creating a chatbot system using Azure OpenAI service and in this case we are usinggpt-4deployment that we created just now.Step 2 – Comparison System Creation
Next, we will create a comparison system that allows us to get responses from both chatbot systems.Step 3 – Response Formatting
Here we will format responses from both systems for display using HTML styles.Step 4 – UI Creation
And then, we will create the side-by-side comparison UI usingipywidgets.Step 5 – System Initialization
And lastly, we will initialize and launch the complete comparison system.
Now, let’s copy this code into the next cell on your Jupyter notebook, and please replace the deployment’s endpoint URI and keys in the azure_openai_config with those from your Azure OpenAI deployment:
Finally, run the cell and once completed, you can now start putting in your questions and it will then provide the responses from both systems.
This is what I’ve gotten when I put in the same question as before, i.e. “what is aro?” (please note that the outcomes may vary):
