Open large language models are becoming increasingly capable and a viable alternative to commercial LLMs such as GPT-4 and Gemini. Given the cost of AI accelerator hardware, developers are considering APIs to consume state-of-the-art language models.
While cloud platforms such as Azure OpenAI, Amazon Bedrock and Google Cloud Vertex AI are the obvious choices, there are purpose-built platforms that are faster and cheaper than the hyperscalers.
Here are five generative AI inference platforms to consume open LLMs like Llama 3, Mistral and Gemma. Some of them also support foundation models targeting vision.
1. Groq
Groq is an AI infrastructure company that claims to build the world’s fastest AI inference technology. Their flagship product is the Language Processing Units (LPU) Inference Engine, a hardware and software platform with the goal of delivering exceptional compute speed, quality and energy efficiency for AI applications. Developers love Groq for its speed and performance.
A scaled network of LPUs powers the GroqCloud service, which enables users to use popular open source LLMs, like Meta AI’s Llama 3 70B at (it’s claimed) up to 18x faster speeds than other providers. You can use Groq’s Python client SDK or OpenAI client SDK to consume the API. It’s easy to integrate Groq with LangChain and LlamaIndex to build advanced LLM applications and chatbots.
In terms of pricing, Groq offers a range of options. For their cloud service, they charge based on tokens processed — with prices ranging from $0.06 to $0.27 per million tokens, depending on the model used. The free tier is a great way to get started with Groq.
2. Perplexity Labs
Perplexity is fast becoming an alternative to Google and Bing. Though its primary product is an AI-powered search engine, they also have an inference engine offered through Perplexity Labs.
In October 2023, Perplexity Labs introduced pplx-api, an API designed to facilitate rapid and efficient access to open source LLMs. Currently in public beta, pplx-api allows users with a Perplexity Pro subscription to access the API, enabling a broad user base to test and provide feedback, which helps Perplexity Labs continuously enhance the tool.
The API supports popular LLMs, including Mistral 7B, Llama 13B, Code Llama 34B and Llama 70B. It is designed to be cost-effective for both deployment and inference, with significant cost savings reported by Perplexity Labs. Users can integrate the API seamlessly with existing applications using the OpenAI client-compatible interface, making it convenient for developers familiar with OpenAI’s ecosystem. For a quick overview, refer to my tutorial on Perplexity API.
The platform also includes llama-3-sonar-small-32k-online and llama-3-sonar-large-32k-online, which are based on the FreshLLM paper. These models, based on Llama3, can return citations — a feature that’s currently in closed beta.
Perplexity Labs offers a flexible pricing model for its API. The pay-as-you-go plan charges users based on the number of tokens processed, making it accessible without upfront commitments. The Pro plan, priced at $20 per month or $200 per year, includes a $5 monthly credit toward API usage, unlimited file uploads and dedicated support.
The price ranges from $0.20 to $1.00 per million tokens, depending on the model’s size. In addition to the token charges, online models incur a flat $5 fee per thousand requests.
3. Fireworks AI
Fireworks AI is a generative AI platform that enables developers to leverage state-of-the-art open source models for their applications. It offers a wide range of language models, including FireLLaVA-13B (a vision-language model), FireFunction V1 (for function calling), Mixtral MoE 8x7B and 8x22B (instruction-following models), and the Llama 3 70B model from Meta.
In addition to language models, Fireworks AI supports image-generation models like Stable Diffusion 3 and Stable Diffusion XL. These models can be accessed through Fireworks AI’s serverless API, which the company says provides industry-leading performance and throughput.
The platform has a competitive pricing model. It offers a pay-as-you-go pricing structure based on the number of tokens processed. For example, the Gemma 7B model costs $0.20 per million tokens, while the Mixtral 8x7B model costs $0.50 per million tokens. Fireworks AI also provides on-demand deployments, where users can rent GPU instances (A100 or H100) on an hourly basis. The API is compatible with OpenAI, making it easy to integrate with LangChain and LlamaIndex.
Fireworks AI targets developers, businesses and enterprises with different pricing tiers. The Developer tier offers a 600 requests/min rate limit and up to 100 deployed models, while the Business and Enterprise tiers provide custom rate limits, team collaboration features and dedicated support.
4. Cloudflare
Cloudflare AI Workers is an inference platform that enables developers to run machine learning models on Cloudflare’s global network with just a few lines of code. It provides a serverless and scalable solution for GPU-accelerated AI inference, allowing developers to leverage pretrained models for various tasks — including text generation, image recognition and speech recognition — without the need to manage infrastructure or GPUs.
Cloudflare AI Workers offers a curated set of popular open source models that cover a wide range of AI tasks. Some of the notable models supported include llama-3-8b-instruct, mistral-8x7b-32k-instruct, gemma-7b-instruct and even vision models like vit-base-patch16-224 and segformer-b5-finetuned-ade-512-pt.
Cloudflare AI Workers offers versatile integration points for incorporating AI capabilities into existing applications or creating new ones. Developers can utilize Cloudflare’s serverless execution environment, Workers and Pages Functions to run AI models within their applications. For those preferring to integrate with their current stack, a REST API is available, enabling inference requests from any programming language or framework. The API supports tasks like text generation, image classification and speech recognition, and developers can enhance their AI applications using Cloudflare’s Vectorize (a vector database) and AI Gateway (a control plane for managing AI models and services).
Cloudflare AI Workers uses a pay-as-you-go pricing model based on the number of neurons processed, offering an affordable solution for AI inference. Because the platform provides a diverse set of models that go beyond LLMs, neurons act as a token-like unit. All accounts have a free tier allowing 10,000 neurons per day, where a neuron aggregates usage across different models. Beyond this, Cloudflare charges $0.011 per 1,000 additional neurons. The cost varies by model size; for instance, Llama 3 70B costs $0.59 per million input tokens and $0.79 per million output tokens, while Gemma 7B costs $0.07 per million tokens for both input and output.
5. Nvidia NIM
The Nvidia NIM API provides access to a wide range of pretrained large language models and other AI models that are optimized and accelerated by Nvidia’s software stack. Through the Nvidia API Catalog, developers can explore and try out over 40 different models from Nvidia, Meta, Microsoft, Hugging Face and other providers. These include powerful text-generation models like Meta’s Llama 3 70B, Microsoft’s Mixtral 8x22B and Nvidia’s own Nemotron 3 8B, as well as vision models like Stable Diffusion and Kosmos 2.
The NIM API allows developers to easily integrate these state-of-the-art AI models into their applications using just a few lines of code. The models are hosted on Nvidia’s infrastructure and exposed through a standardized OpenAI-compatible API, enabling seamless integration. Developers can prototype and test their applications for free using the hosted API, with options to deploy the models on premises or in the cloud using the recently launched Nvidia NIM containers when ready for production.
Nvidia provides both free and paid tiers for the NIM API. The free tier includes 1,000 credits to get started, while paid pricing is based on the number of tokens processed and model size, ranging from $0.07 per million tokens for smaller models like Gemma 7B, up to $0.79 per million output tokens for large models like Llama 3 70B.
The above list is a subset of inference platforms offering language models as a service. In an upcoming article, I will cover self-hosted model servers and inference engines that can run on Kubernetes. Stay tuned.
The post 5 Open LLM Inference Platforms for Your Next AI Application appeared first on The New Stack.
Five generative AI inference platforms to consume open LLMs like Llama 3, Mistral and Gemma. Some also support models targeting vision.