SambaNova Systems, a seven-year-old company that is working to establish a foothold in the rapidly expanding AI chip market, is rolling out a cloud service that executives say accelerates the inferencing work for developers creating AI applications.
The AI chip and model vendor this week unveiled the SambaNova Cloud, a fast API service that runs on its SN40L AI chip and that developers can use when leveraging Meta’s Llama 3.1 AI models, including the 405 billion parameter variant, the largest of the three open models that launched in July.
It’s a significant step in a quickly evolving generative AI space that is putting a growing importance on inferencing tasks, according to Anton McGonnell, vice president of product at SambaNova.
“For enterprises, it’s incredibly important,” McGonnell told The New Stack. “Nobody wants to be beholden to a single supplier because the market power that they wield is obviously far too large. The appetite for a viable alternative is hugely important for enterprises. Developers probably, if all the things were equal, don’t really care. They have a job to do and they’re going to consume whatever is available to them. They care because they’re getting something that they can’t get with GPUs. For inference speed, really fast token generation is just becoming hugely important for all of these new use cases [and] agent types of use cases. It’s important because the GPUs can’t give it to them.”
Nvidia is by far the dominant chip supplier for AI systems, with skyrocketing demand for its highly popular Tensor Core H100 GPUs continuing to put a squeeze on supply while hyperscale cloud providers and AI companies alike anticipate the upcoming launch of the vendor’s powerful Blackwell products.
Focusing on Inferencing
However, while most jobs training large language models (LLMs) are run on GPU-powered systems, the chips are not as efficient for inferencing, giving companies like SambaNova an opening to establish themselves in this growing section of AI computing. Over the past three months, three of the smaller AI chip makers — including SambaNova — eyeing the inferencing segment rolled out offerings that they say give them an advantage over GPUs from Nvidia, AMD and others in inferencing.
Think of AI training as a student going to school, and inferencing as the work the student does using that training after graduating.
Make Room for Startups
SambaNova rival Groq in July announced that the three Llama 3.1 models — not only the 405B but also the 8B and 70B versions — are available on GroqCloud Dev Console, a community of more than 300,000 developers that builds AI software using Groq systems, and on GroqChat, aimed at the general public.
Rather than a GPU, Groq offers its language processing unit (LPU), a chip designed for inferencing and language, according to the company.
“With every new release from Meta, we see a significant surge of developers joining our platform,” Groq founder and CEO Jonathan Ross said at the time. “In the last five months, we’ve grown from just a handful of developers to over 300,000, attracted by the quality and openness of Llama, as well as its incredible speed on the Groq LPU.”
Last month, another competitor, Cerebras Systems, launched Cerebras Inference, a service that can provide 1,800 tokens per second for Llama 3.1 8B, and 450 tokens per second for 70B. Company executives said Cerebras Inference is 20 times faster than Nvidia GPU-based inference solutions in hyperscale clouds and much less expensive.
Both Groq and Cerebras use SRAM to accelerate inferencing tasks. Karl Freund, founder and principal analyst at Cambrian-AI Research, wrote in a column for Forbes that SRAM is about 100 times faster than the high bandwidth memory (HBM) used in GPUs, though it also costs more than DRAM or HBM. SRAM also is much smaller than HBM, so it’s best for inference tasks because it can’t do training, Freund wrote.
Tiered Memory
By contrast, SambaNova uses three tiers of memory in its SN40L RDU (reconfigurable dataflow units) — 520MB of on-chip SRAM, 64GB HBM and 1.5TB of DDR5, which helps allow the chip to run both training and inferencing workloads and gives the company as step up on its rivals.
“Both Grok and Cerebras are taking what I would consider to be an incredibly inefficient approach to running these models,” McGonnell said. “They’re just mapping it to SRAM. The weights of the models are stored on SRAM and the only way to do that is to have lots and lots and lots of chips.”
Instead, SambaNova can run all three Llama 3.1 models on a single 16-socket node server, and performance isn’t impacted if more models are loaded onto the server. In addition, SambaNova is the first to offer the 405B model to developers, according to company officials, because its chip is more efficient and faster. The SambaNova Cloud service can run the 405B — the largest open generative AI model — at more than 100 tokens pers second, a key measurement for AI workloads. The 70B model can run at up to 580 tokens per second.
“We’re able to get more speed” than Groq or Cerebras, McGonnell said. “We’re not mapping this all on the SRAM like they are. We were able to get more speed, but we have much, much more efficiency than the approaches those companies are taking. This is a very real thing. It’s very valuable.”
Larger models tend to run inference workloads slower than smaller ones, which can lead to such problems as delayed responses, system failures and accidents. By accessing those inferencing tasks, the SambaNova Cloud lets developers reduce some of the issues. Developers can log on to the SambaNova Cloud.
The Need for Speed
For SambaNova and its rivals, greater speed means being able to address agentic modeling. Typical AI models respond to prompts or run predefined tasks. Agentic AI goes further, interacting with other external agents — humans, models or physical devices — to reach a common goal, which requires the model to understand the context, make decisions, plan actions and generate responses in real time, which are critical capabilities in use cases like autonomous driving and customer service.
SambaNova, Cerebras and Groq bring a lot of money to the table. SambaNova has raised more than $1.13 billion over four rounds, while Cerebras has pulled in $720 million and Groq raised $640 million last month to raise its valuation to $2.8 billion.
The vendors are going to need such large amounts of cash if they plan to navigate their way past Nvidia and other large established players and carve out a space in the highly competitive AI chip field.
A Big, Expanding AI Chip Market
The global AI chip market is a booming one, with data portal Statista reporting it will expand from $53.66 billion in 2023 to $91.96 billion next year. According to Cerebras, inference is the fastest growing segment of AI compute, accounting for about 40% of the AI hardware market.
Nvidia may be the top dog in the AI chip space, with its H100 being the go-to product for AI. However the rapidly increasing demand for AI chips has tightened the supply for the H100s and other processors, and other companies are looking to fill that gap. That includes hyperscale cloud players like Amazon Web Services (AWS), Microsoft and Google, which are developing their own processors, and chip makers Intel, AMD and Arm, which are expanding their AI product portfolios.
The post Small AI Chip Makers Eye Gains in Inference Workloads appeared first on The New Stack.
SambaNova unveils its cloud service for AI software developers, joining Cerebras, Groq and others trying to chip away at Nvidia’s dominance.