
Retrieval-augmented generation (RAG) helps large language models (LLMs) generate more accurate and relevant responses by providing context through external data sources. Using real-time retrieval mechanisms to provide updated and domain-specific knowledge, applications such as chatbots, summarization tools and other interactive AI systems can provide more targeted results, reducing the need for fine-tuning and limiting hallucinations.
This post follows my journey implementing a RAG application, exploring what worked, what didn’t and my learnings along the way.
The Goal
To implement a simple RAG, I used Aerospike Vector Search (AVS) to perform semantic search across the Aerospike documentation, and an LLM to generate responses to user questions.
The workflow looks something like this:
- User submits a query.
- An embedding model (Nomic) generates a vector from the query.
- AVS uses the vector to perform a semantic search, returning relevant documentation for context.
- The search results are combined with the query into an engineered prompt.
- The prompt is passed to the LLM (Gemma) which streams a response to the user.

Figure 1: RAG production workflow
Complex RAGs take this process a step further, introducing additional layers of processing by incorporating concepts like reranking and traversing graph databases.
The Data
Before we can do anything, we must identify the data we’re going to retrieve – in other words, the “retrieval” part of our RAG. As mentioned, I decided to use the Aerospike documentation. I employed a web scraper to crawl all the content and feed it into a pipeline that breaks up, or “chunks,” each document and stores the content in AVS.
Chunking is no easy task. Each document must be broken into multiple, smaller parts. We want each chunk to maintain context within the document, being careful not to break individual thoughts across chunks. For example, a block of sample code should probably not be broken up and most likely needs some of the text from before or after the block to understand the context of the code. The same could be said of tables and other blocks of information that may lose context easily. Maintaining some overlap between chunks can be helpful as well.
I employed an incredibly complex chunking algorithm where I took the scraped HTML, broke it up by the top-level elements, converted them to Markdown and concatenated chunks until I reached about 2,000 or more words each … OK, it’s not complex at all, and I’m sure you can quickly pick up on some pitfalls of this method, like:
- What happens if my current chunk is 1,999 words and the next chunk to concatenate is 5,000 words?
- How am I maintaining context through the document just by counting words?
- You said overlapping chunks was a good idea. Why didn’t you do that?
I stuck with this method for two reasons:
- I was on a deadline and didn’t have a lot of time to mess around.
- Because I actually saw pretty great results.
This is very much a trial-and-error task, and what worked for my content won’t necessarily work for yours. Explore a variety of chunking methods to see what works best for your use case.
Generating and Storing Vectors
Once a chunk has reached its target word count, we need to store the content, along with a vector embedding, in the database. To generate the vector embedding, we need an embedding model. The vector, a list of floating-point numbers, captures the semantic meaning of each chunk, allowing us to search our content for similar text based on meaning.
I ended up using Nomic AI’s nomic-embed-text-v1.5. This model works really well, using prefixes for query and document embeddings, along with having an 8k context window (the number of tokens it can embed), and a rather small footprint at ~500MB. The model used for generating the vector embeddings during the data loading will also need to be used for the query processing in our application.
When a chunk is ready to be stored, a record is created in Aerospike Vector Search. Each record contains the document chunk, the vector embedding, the chunk index within the document, and the page URL of the original documentation. AVS maintains an index on the vector embeddings stored within each record and allows semantic search through a powerful Hierarchical Navigable Small World (HNSW) algorithm. Check out the docs if you’re interested in learning more.
The Application
Now that we have a database full of content to retrieve, we can start using it to augment the LLM’s response generation. See what I did there? Nice.
Where do we start?
The Frontend
We need some mechanism for users to input a query. Since this app is working with Aerospike’s documentation, we’ll want users to ask questions about Aerospike. A simple frontend should suffice. I decided to use React because that’s what I know, but all the frontend really needs is an input for user queries, a mechanism to talk to our backend and a place to stream the response.
All right, what’s next?
The Backend
In the world of AI applications, Python is a highly supported language you’ll see used quite often, and for that reason, along with the AVS client using Python, that’s what I chose to build the back-end.
I chose FastAPI to build out the server API, and Uvicorn for deployment. The server only needs one endpoint for the frontend to access, sending the user query. The basic flow of that function looks like this (you can check out the code if you’d like):
- The endpoint is triggered with a user query.
- The embedding model generates a vector.
- The vector is used as input to a cosine similarity search in AVS.
- The returned records are used to generate a context string from document chunks.
- The original user query is combined with the context string in an engineered prompt.
- The prompt is sent to the LLM for processing.
- The LLM streams its response, which is returned to the frontend.
Easy-peasy.
It can’t be that simple, can it? Well, yes and no.
Some Takeaways
Chunking
Going back to the chunking method, I had to tweak the way the content was chunked, not only for quality search results, but also to fit the retrieved context within the context window of the LLM. I started down this road using Llama2-70B, which has a 4k context window, meaning both the prompt and the response need to fit into 4k tokens (one token is on average ~¾ of a word).
My initial chunking methods overflowed this restriction far too often. I moved to Google’s Gemma-7B-Instruct model, opening the context window to 8k, but still saw issues on occasion. The code linked within this post is set up to use OpenAI’s API, so context windows probably won’t be an issue going forward.
Prompt Engineering
Prompt engineering was way harder than I thought it would be. This is what I ended up with:
'''\ You are a helpful assistant answering questions about the Aerospike NoSQL database. Using the following context, answer the question. If you are unable to answer the question, ask for more information. Context: {context} Question: {question} '''
It’s not that complex, but I went through many iterations to get to a point where I was able to reliably return quality responses from the LLM. Changing models meant a new look at the prompt, as some models handle prompting differently than others. Prompt engineering is absolutely a science, but it’s also a bit of an art.
There are many knobs and dials to turn when working with an application like this, and though my methods worked well for me, this is by no means the “correct” or “only” way to do this.
Conclusion
RAG helps to facilitate the delivery of accurate and contextually relevant responses, making it a scalable solution for dynamic data environments. As RAG continues to evolve, it can provide even more refined, accurate and real-time responses, driving innovation and enhancing user interactions across various domains.
You can find the working code on GitHub and play around with your own RAG application.
Spin up your own Aerospike Vector Search sandbox. Let us know your thoughts and questions by joining us on Discord.
The post How I Enhanced Large Language Models With Simple RAG appeared first on The New Stack.
RAG can provide updated and domain-specific knowledge, returning more targeted results that reduce the need for fine-tuning and limiting hallucinations.