The Nvidia NIM platform allows developers to perform inference on generative AI models. In this article, we will explore how to consume the NIM APIs to build a simple RAG application. For the vector database, we will use Zilliz, the hosted, commercial version of the popular Milvus vector database.
We will use meta/llama3-8b-instruct
as the LLM, nvidia/nv-embedqa-e5-v5
as the text embeddings model, and Zilliz to perform semantic search.
While this tutorial focuses on cloud-based APIs, the next part of this series will run the same LLM, embeddings model and vector database as containers.
The advantage of using NIM is that the APIs will be 100% compatible with the self-hosted containers running locally on a GPU machine. They also take advantage of the GPU acceleration when run locally.
Let’s get started building the application.
Step 1: Create an API Key for NIM
Visit the NIM catalog and sign up with your email address to create an API key.
Search for meta/llama3-8b-instruct
and click on “Build with this NIM” to create an API key.
Copy the API key and save it in a safe location.
Step 2: Create an Instance of Free Zilliz Cluster
Sign up with Zilliz Cloud and create a cluster that comes with $100 credits, which is sufficient for experimenting with this tutorial.
Make sure that you copied the endpoint URI and the API key of your cluster.
Step 3: Create an Environment Configuration File
Create a .env
file with the URIs and API keys. This comes in handy when we access the API. When we switch to local endpoints, we just need to update this file. Ensure that these match with the values you saved from the above two steps.
LLM_URI="https://integrate.api.nvidia.com/v1" EMBED_URI="https://integrate.api.nvidia.com/v1" VECTORDB_URI="YOUR_ZILLIZ_CLUSTER_URI" NIM_API_KEY="YOUR_NIM_API_KEY" ZILLIZ_API_KEY="YOUR_ZILLIZ_API_KEY"
Step 4: Create the RAG Application
Launch a Jupyter Notebook and install the required Python modules.
!pip install pymilvus !pip install openai pip install python-dotenv
Let’s start by importing the modules.
from pymilvus import MilvusClient from pymilvus import connections from openai import OpenAI from dotenv import load_dotenv import os import ast
Load the environment variables and initialize the clients for LLM, embeddings and the vector database.
load_dotenv() LLM_URI=os.getenv("LLM_URI") EMBED_URI=os.getenv("EMBED_URI") VECTORDB_URI=os.getenv("VECTORDB_URI") NIM_API_KEY=os.getenv("NIM_API_KEY") ZILLIZ_API_KEY=os.getenv("ZILLIZ_API_KEY") llm_client = OpenAI( api_key=NIM_API_KEY, base_url=LLM_URI ) embedding_client = OpenAI( api_key=NIM_API_KEY, base_url=EMBED_URI ) vectordb_client = MilvusClient( uri=VECTORDB_URI, token=ZILLIZ_API_KEY )
The next step is to create the collection in the Zilliz cluster.
if vectordb_client.has_collection(collection_name="india_facts"): vectordb_client.drop_collection(collection_name="india_facts") vectordb_client.create_collection( collection_name="india_facts", dimension=1024, )
We set the dimension to 1,024 based on the vector size returned by the embeddings model.
Let’s create a list of strings, convert them into embedding vectors and ingest them into the database.
docs = [ "India is the seventh-largest country by land area in the world.", "The Indus Valley Civilization, one of the world's oldest, originated in India around 3300 BCE.", "The game of chess, originally called 'Chaturanga,' was invented in India during the Gupta Empire.", "India is home to the world's largest democracy, with over 900 million eligible voters.", "The Indian mathematician Aryabhata was the first to explain the concept of zero in the 5th century.", "India has the second-largest population in the world, with over 1.4 billion people.", "The Kumbh Mela, held every 12 years, is the largest religious gathering in the world, attracting millions of devotees.", "India is the birthplace of four major world religions: Hinduism, Buddhism, Jainism, and Sikhism.", "The Indian Space Research Organisation (ISRO) successfully sent a spacecraft to Mars on its first attempt in 2014.", "India's Varanasi is considered one of the world's oldest continuously inhabited cities, with a history dating back over 3,000 years." ] def embed(docs): response = embedding_client.embeddings.create( input=docs, model="nvidia/nv-embedqa-e5-v5", encoding_format="float", extra_body={"input_type": "query", "truncate": "NONE"} ) vectors = [embedding_data.embedding for embedding_data in response.data] return vectors vectors=embed(docs) data = [ {"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"} for i in range(len(vectors)) ] vectordb_client.insert(collection_name="india_facts", data=data)
We will then create a helper function to retrieve the context from the vector database.
def retrieve(query): query_vectors = embed([query]) search_results = vectordb_client.search( collection_name="india_facts", data=query_vectors, limit=3, output_fields=["text", "subject"] ) all_texts = [] for item in search_results: try: evaluated_item = ast.literal_eval(item) if isinstance(item, str) else item except: evaluated_item = item if isinstance(evaluated_item, list): all_texts.extend(subitem['entity']['text'] for subitem in evaluated_item if isinstance(subitem, dict) and 'entity' in subitem and 'text' in subitem['entity']) elif isinstance(evaluated_item, dict) and 'entity' in evaluated_item and 'text' in evaluated_item['entity']: all_texts.append(evaluated_item['entity']['text']) return " ".join(all_texts)
This retrieves the top three documents, appends the text from each document and returns a string.
With the retriever step in place, it’s time to create another helper function to generate the answer from the LLM.
def generate(context, question): prompt = f''' Based on the context: {context} Please answer the question: {question} ''' system_prompt=''' You are a helpful assistant that answers questions based on the given context.\n Don't add anything to the response. \n If you cannot find the answer within the context, say I do not know. ''' completion = llm_client.chat.completions.create( model="meta/llama3-8b-instruct", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": prompt} ], temperature=0, top_p=1, max_tokens=1024 ) return completion.choices[0].message.content
We will finally wrap these two functions inside another function called chat
, which first retrieves the context and then sends it to the LLM along with the original prompt sent by the user.
def chat(prompt): context=retrieve(prompt) response=generate(context,prompt) return response
When we invoke the function, we will see the response from the LLM derived from the context.
As you can see, the response is based on the context that the vector database has retrieved.
In the next part of this series, we will run all the components of this RAG application locally on a GPU-accelerated machine. Stay tuned.
The post Build a RAG App With Nvidia NIM APIs and a Vector Database appeared first on The New Stack.
A tutorial on how to use NIM APIs to build a simple RAG application, using Zilliz, a version of the popular Milvus vector database.