Quantcast
Channel: Artificial Intelligence News, Analysis and Resources - The New Stack
Viewing all articles
Browse latest Browse all 321

Building an Open LLM App Using Hermes 2 Pro Deployed Locally

$
0
0
llamas

In my previous tutorial, I showed you how to bring real-time data to LLMs through function calling, using OpenAI’s latest LLM GPT-4o. In this followup, I will look at function calling using Hermes 2 Pro – Llama-3 8B, a powerful LLM developed by Nous Research and based on Meta’s Llama 3 architecture, with 8 billion parameters. It’s an open model and we will run it on Hugging Face’s Text Generation Inference.

As with the previous post, we will integrate the API from Flightaware.com — with the LLM to track flight status in real-time.

FlightAware’s AeroAPI is a perfect tool for developers to gain access to comprehensive flight information. It enables real-time flight tracking, historical and future flight data, and flight searches by various criteria. The API presents data in a user-friendly JSON format, making it highly usable and integrable. We will invoke the REST API to get the real-time status of a flight based on the prompt sent to an LLM by the user.

What Is Hermes 2 Pro?

Hermes 2 Pro – Llama-3 8B excels at natural language processing tasks, creative writing, coding assistance, and more. One of its standout features is its exceptional function-calling capability, which allows it to execute external functions and retrieve information related to stock prices, company fundamentals, financial statements, and more.

This model leverages a special system prompt and multi-turn function calling structure with a new ChatML role, making function calling reliable and easy to parse. According to benchmarks, Hermes 2 Pro – Llama-3 scored an impressive 90% on the function calling evaluation built in partnership with Fireworks AI.

Deploying Hermes 2 Pro Locally

For this setup, I am using a Linux server powered by an NVIDIA GeForce RTX 4090 GPU, which comes with 24GB of VRAM. It’s running Docker and the NVIDIA Container Toolkit to enable containers to access the GPU.

We will use the Text Generation Inference server from Hugging Face to run Hermes 2 Pro.

The below command launches the inference engine on port 8080 and serves the LLM through a REST endpoint.

export token="YOUR_HF_TOKEN"

export model="NousResearch/Hermes-2-Pro-Llama-3-8B"

export volume="/home/ubuntu/data"

docker run --name hermes -d --gpus all -e HUGGING_FACE_HUB_TOKEN=$token --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.3 --model-id $model --max-total-tokens 8096


To test the endpoint, run the following command:


curl 127.0.0.1:8081  \
  -X POST \
  -H 'Content-Type: application/json' \
 -d '{"inputs":"What is Deep Learning?"}'


If everything is right, you should see the response from Hermes 2 Pro.

Function to Track Flight Status

Before proceeding, sign up with FlightAware and obtain your API key, which is required for using the REST API. The free personal tier is sufficient to complete this tutorial.

Once you have the API key, create the below function in Python to retrieve the status of any flight.

Though the code is straightforward, let me explain the key steps.

This function, get_flight_status takes a flight parameter (assumed to be a flight identifier) and returns formatted flight details in JSON format. It queries the AeroAPI to fetch flight data based on the given flight identifier and formats key details such as the source, destination, departure time, arrival time, and status.

Let’s look at the components of the script:

API Credentials:
AEROAPI_BASE_URL is the base URL for the FlightAware AeroAPI.
AEROAPI_KEY is the API key used for authentication.

Session Management:
get_api_session: This nested function initializes a request. This sets the required header with the API key, and returns the session object. This session will handle all API requests.

Data Fetching:
fetch_flight_data: This function takes flight_id and session as arguments. It constructs the endpoint URL with appropriate date filters for fetching data for one day, and sends a GET request to retrieve the flight data. The function handles the API response and extracts the relevant flight information.

Time Conversion:
utc_to_local: Converts UTC time (from the API response) to local time based on the provided timezone string. This function helps us get the arrival and departure times based on the city.

Data Processing:
The script determines keys for departure and arrival times based on the availability of estimated or actual times, with a fallback to scheduled times. It then constructs a dictionary containing formatted flight details.

The above screenshot shows the response we received from FlightAware API for Emirates flight EK524 that flies from Dubai to Hyderabad. Notice that the arrival and departure times are local times based on the city.

Our goal is to integrate this function with Gemini 1.0 Pro to give it real-time access to flight tracking information.

Integrating the Function with Hermes 2 Pro

Start by installing the latest version of Hugging Face Python SDK with the below command:

pip install --upgrade huggingface_hub


Import the module and initialize the client by pointing it to the TGI endpoint.

from huggingface_hub import InferenceClient

client = InferenceClient("http://127.0.0.1:8080")


Next, define the function schema in the same format as OpenAPI function calling.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_flight_status",
            "description": "Get status of a flight",
            "parameters": {
                "type": "object",
                "properties": {
                    "flight": {
                        "type": "string",
                        "description": "Flight number"
                    }
                },
                "required": ["flight"]
            }
        }
    }
]


This populates the list with one or more functions that the LLM can use as tools.

We will now create the chatbot that accepts the prompt and determines if the function needs to be called. If yes, then the LLM first returns the function name and the arguments, which need to be invoked. The output from the function is sent to the LLM as a part of the second invocation. The final response will have the factually correct answer based on the function’s output.

def chatbot(prompt):
    messages = [
        {
            "role": "system",
            "content": "You're a helpful assistant! Answer the users question best you can based on the tools provided. Be concise in your responses.",
        },
        {
            "role": "user",
            "content": prompt
        },
    ]

    response = client.chat_completion(messages=messages, tools=tools)
    tool_calls = response.choices[0].message.tool_calls

    if tool_calls:
        available_functions = {
            "get_flight_status": get_flight_status,
        }
        
        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_to_call = available_functions[function_name]
            function_args = tool_call.function.arguments
        
            function_response = function_to_call(flight=function_args.get("flight"))
            
            messages.append(
                {
                    "role": "tool",
                    "name": function_name,
                    "content": function_response
                }
            )
            
        final_response = client.chat_completion(messages=messages)
        return final_response
        
    return response


The automatic formatting of the prompt that the target LLM expects is one benefit of using Hugging Face Python libraries. For example, When using functions, the prompt for Hermes 2 Pro needs to be structured in a specific format:

<|im_start|>system
You are a function calling AI model. You are provided with function signatures within  XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:  [{'type': 'function', 'function': {'name': 'get_stock_fundamentals', 'description': 'Get fundamental data for a given stock symbol using yfinance API.', 'parameters': {'type': 'object', 'properties': {'symbol': {'type': 'string'}}, 'required': ['symbol']}}}]  Use the following pydantic model json schema for each tool call you will make: {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']} For each function call return a json object with function name and arguments within  XML tags as follows:

{'arguments': , 'name': }
<|im_end|>


Similarly, the output of the function can be sent to the LLM in the below format:

<|im_start|>tool

{"name": "get_stock_fundamentals", "content": {'symbol': 'TSLA', 'company_name': 'Tesla, Inc.', 'sector': 'Consumer Cyclical', 'industry': 'Auto Manufacturers', 'market_cap': 611384164352, 'pe_ratio': 49.604652, 'pb_ratio': 9.762013, 'dividend_yield': None, 'eps': 4.3, 'beta': 2.427, '52_week_high': 299.29, '52_week_low': 152.37}}

<|im_end|>


Ensuring that the prompt follows this template requires careful formatting. The InferenceClient class handles this translation efficiently, enabling the developer to use the familiar OpenAI format of system, user, tool, and assistant roles in the prompt.

During the first call to the chat completion API, the LLM responds with the below answer:

Subsequently, after invoking the function, we embed the result within the message and send it back to the LLM.

As you can see, the workflow of integrating function calling is very similar to that of OpenAI.

It’s time to invoke the chatbot and test it through a prompt.

res=chatbot("What's the status of EK226?")
print(res.choices[0].message.content)


This concludes the tutorial on using a function-calling technique with the open model, Hermes 2 Pro. The complete code for the chatbot is shown below.

The post Building an Open LLM App Using Hermes 2 Pro Deployed Locally appeared first on The New Stack.

How to do function calling using Hermes 2 Pro - Llama-3 8B, a powerful LLM based on Meta's Llama 3 architecture, and running on Hugging Face.

Viewing all articles
Browse latest Browse all 321

Trending Articles