Build a RAG App With Nvidia NIM APIs and a Vector Database

The Nvidia NIM platform allows developers to perform inference on generative AI models. In this article, we will explore how to consume the NIM APIs to build a simple RAG application. For the vector database, we will use Zilliz, the hosted, commercial version of the popular Milvus vector database.

We will use meta/llama3-8b-instruct as the LLM, nvidia/nv-embedqa-e5-v5 as the text embeddings model, and Zilliz to perform semantic search.

While this tutorial focuses on cloud-based APIs, the next part of this series will run the same LLM, embeddings model and vector database as containers.

The advantage of using NIM is that the APIs will be 100% compatible with the self-hosted containers running locally on a GPU machine. They also take advantage of the GPU acceleration when run locally.

Let’s get started building the application.

Step 1: Create an API Key for NIM

Visit the NIM catalog and sign up with your email address to create an API key.

Search for meta/llama3-8b-instruct and click on “Build with this NIM” to create an API key.

Copy the API key and save it in a safe location.

Step 2: Create an Instance of Free Zilliz Cluster

Sign up with Zilliz Cloud and create a cluster that comes with $100 credits, which is sufficient for experimenting with this tutorial.

Make sure that you copied the endpoint URI and the API key of your cluster.

Step 3: Create an Environment Configuration File

Create a .env file with the URIs and API keys. This comes in handy when we access the API. When we switch to local endpoints, we just need to update this file. Ensure that these match with the values you saved from the above two steps.

LLM_URI="https://integrate.api.nvidia.com/v1"
EMBED_URI="https://integrate.api.nvidia.com/v1"
VECTORDB_URI="YOUR_ZILLIZ_CLUSTER_URI"
NIM_API_KEY="YOUR_NIM_API_KEY"
ZILLIZ_API_KEY="YOUR_ZILLIZ_API_KEY"

Step 4: Create the RAG Application

Launch a Jupyter Notebook and install the required Python modules.

!pip install pymilvus
!pip install openai
pip install python-dotenv

Let’s start by importing the modules.

from pymilvus import MilvusClient
from pymilvus import connections
from openai import OpenAI
from dotenv import load_dotenv
import os
import ast

Load the environment variables and initialize the clients for LLM, embeddings and the vector database.

load_dotenv()

LLM_URI=os.getenv("LLM_URI")
EMBED_URI=os.getenv("EMBED_URI")
VECTORDB_URI=os.getenv("VECTORDB_URI")

NIM_API_KEY=os.getenv("NIM_API_KEY")
ZILLIZ_API_KEY=os.getenv("ZILLIZ_API_KEY")
llm_client = OpenAI(
  api_key=NIM_API_KEY,
  base_url=LLM_URI
)

embedding_client = OpenAI(
 api_key=NIM_API_KEY,
 base_url=EMBED_URI
)

vectordb_client = MilvusClient(
    uri=VECTORDB_URI,
    token=ZILLIZ_API_KEY
)

The next step is to create the collection in the Zilliz cluster.

if vectordb_client.has_collection(collection_name="india_facts"):
    vectordb_client.drop_collection(collection_name="india_facts")

vectordb_client.create_collection(
    collection_name="india_facts",
    dimension=1024,  
)

We set the dimension to 1,024 based on the vector size returned by the embeddings model.

Let’s create a list of strings, convert them into embedding vectors and ingest them into the database.

docs = [
    "India is the seventh-largest country by land area in the world.",
    "The Indus Valley Civilization, one of the world's oldest, originated in India around 3300 BCE.",
    "The game of chess, originally called 'Chaturanga,' was invented in India during the Gupta Empire.",
    "India is home to the world's largest democracy, with over 900 million eligible voters.",
    "The Indian mathematician Aryabhata was the first to explain the concept of zero in the 5th century.",
    "India has the second-largest population in the world, with over 1.4 billion people.",
    "The Kumbh Mela, held every 12 years, is the largest religious gathering in the world, attracting millions of devotees.",
    "India is the birthplace of four major world religions: Hinduism, Buddhism, Jainism, and Sikhism.",
    "The Indian Space Research Organisation (ISRO) successfully sent a spacecraft to Mars on its first attempt in 2014.",
    "India's Varanasi is considered one of the world's oldest continuously inhabited cities, with a history dating back over 3,000 years."
]

def embed(docs):
    response = embedding_client.embeddings.create(
        input=docs,
        model="nvidia/nv-embedqa-e5-v5",
        encoding_format="float",
        extra_body={"input_type": "query", "truncate": "NONE"}
    )
    vectors = [embedding_data.embedding for embedding_data in response.data]
    return vectors

vectors=embed(docs)

data = [
    {"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"}
    for i in range(len(vectors))
]

vectordb_client.insert(collection_name="india_facts", data=data)

We will then create a helper function to retrieve the context from the vector database.

def retrieve(query):
    query_vectors = embed([query])

    search_results = vectordb_client.search(
        collection_name="india_facts",
        data=query_vectors,
        limit=3,
        output_fields=["text", "subject"]
    )

    all_texts = []
    for item in search_results:
        try:
            evaluated_item = ast.literal_eval(item) if isinstance(item, str) else item
        except:
            evaluated_item = item
        
        if isinstance(evaluated_item, list):
            all_texts.extend(subitem['entity']['text'] for subitem in evaluated_item if isinstance(subitem, dict) and 'entity' in subitem and 'text' in subitem['entity'])
        elif isinstance(evaluated_item, dict) and 'entity' in evaluated_item and 'text' in evaluated_item['entity']:
            all_texts.append(evaluated_item['entity']['text'])
    
    return " ".join(all_texts)

This retrieves the top three documents, appends the text from each document and returns a string.

With the retriever step in place, it’s time to create another helper function to generate the answer from the LLM.

def generate(context, question):
    prompt = f'''
    Based on the context: {context}
    
    Please answer the question: {question}
    '''    
    system_prompt='''
    You are a helpful assistant that answers questions based on the given context.\n
    Don't add anything to the response. \n
    If you cannot find the answer within the context, say I do not know.     
    '''
    completion = llm_client.chat.completions.create(
      model="meta/llama3-8b-instruct",
      messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
      ],
      temperature=0,
      top_p=1,
      max_tokens=1024
    )
    return completion.choices[0].message.content

We will finally wrap these two functions inside another function called chat, which first retrieves the context and then sends it to the LLM along with the original prompt sent by the user.

def chat(prompt):
    context=retrieve(prompt)
    response=generate(context,prompt)
    return response

When we invoke the function, we will see the response from the LLM derived from the context.

As you can see, the response is based on the context that the vector database has retrieved.

In the next part of this series, we will run all the components of this RAG application locally on a GPU-accelerated machine. Stay tuned.

The post Build a RAG App With Nvidia NIM APIs and a Vector Database appeared first on The New Stack.

A tutorial on how to use NIM APIs to build a simple RAG application, using Zilliz, a version of the popular Milvus vector database.

Build a RAG App With Nvidia NIM APIs and a Vector Database

Step 1: Create an API Key for NIM

Step 2: Create an Instance of Free Zilliz Cluster

Step 3: Create an Environment Configuration File

Step 4: Create the RAG Application

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112