Developer-turned-CEO Lin Qiao foresees a new emerging era in AI, where language model models are fine-tuned based on an organization’s own specialized data. This will allow organizations to take advantage of the language capabilities of AI while leveraging their own data sets to inform the feedback.
Before becoming CEO of Fireworks AI, Qiao led Meta’s PyTorch efforts. Generative AI can solve hundreds of complex logic problems, she noted, but that’s not the problem enterprises and developers typically face.
“The large model is too expensive to operate, and doesn’t give you the low latency for a good product experience,” she said. “That puts pressure [on] for people to go to smaller models.”
Smaller models also work better with what business problems developers are trying to solve.
“They have maybe five business-specific tasks to solve,” she said. “We [are] laser focusing on those smaller open source models, how to bring [them] on par with OpenAI‘s model in terms of quality or even beat them in terms of quality. At the same time, we provide much lower latency and much lower TCO (total cost of ownership) for those B2C applications and products.”
In this emerging AI era, Qiao said there are two problems developers face:
- Performing fast iterations of training using enterprise data.
- Scaling generative AI applications in production.
The company she co-founded, Fireworks AI, is “laser-focused” on handling these two problems for developers, she told The New Stack. “We offer extremely fast fine-tuning,” she added.
Fireworks AI leverages open source models. It recently raised $25 million in funding and claims 12,000 users, including Quora, Sourcegraph and the AI-Powerpoint presentation company, Tome. It estimates it serves more than 25 billion tokens daily.
Latency Is Critical in AI Applications
Qiao learned that for B2C companies like Meta, where she previously worked, interactivity and low latency are absolute requirements. Content generation specifically impacts whether a product is viable or not, she said; creating a quality AI product requires using your own data and iterating the model quickly, she added.
“All the developers at enterprises we talked to, they have their proprietary data, use our fine-tuning platform, and generate a customized model,” she said. “A one-click upload to our inference platform, and then your product can talk to your customized model directly using the content generated from your model.”
Developers then must look at the product metrics, adjust the data if needed and keep the loop going to fine-tune the models.
Then, the AI application must be able to scale very quickly while delivering a low total cost of ownership, she added.
“If the cost is high, then you bleed money much faster, so it will be a disaster and you won’t have a viable business,” she said. “Both latency and TCO are important for B2C companies.”
The Cost Challenge of AI Apps
But even with a great product, generative AI applications can be more expensive than traditional applications, which becomes a factor in the total cost of ownership. One way generative AI applications are different than traditional applications is that they require running on GPUs instead of CPUs, which are heavily commoditized.
“GPUs are expensive — it’s not just the chips that are expensive. A GPU is very power-hungry. Power is expensive. Power produces heat. It cannot use air cooling, it has to use liquid cooling or inverse cooling where you dump the chips in oil [to] take away heat,” Qiao said. “So all the supporting infrastructure jacks up the whole infrastructure cost of GenAI.”
That cost can be an additional barrier for business viability, she added. Fireworks attempts to help companies address the TCO challenge by focusing on smaller, open source models that are on par or better than LLM generative AI offerings, while being more cost-effective to run.
“We provide much lower latency and much lower TCO for those B2C applications and products,” she said.
Use Cases for Smaller Models
Many of Fireworks AI’s customers are using AI to create assistants, she said — medical assistants, legal assistants and coding assistants are popular use cases. That makes latency a particularly important challenge with AI due to the interactive, conversational nature of its output.
Documents are another use case she frequently sees. From images to PDFs, AI is being used to scan and search documents for product catalogs, e-commerce and even risk analysis. Tome, a customer of Fireworks AI, uses AI to build presentation slides for business users.
Without a fast response time, an AI application can become a horrible product, she added.
“That response time usually has to be half a second or one second,” she said. “It becomes a much more interesting product because it’s responsive, interactive.”
The post Why Latency and ‘Total Cost of Ownership’ Matter More in AI Apps appeared first on The New Stack.
Fireworks AI CEO Lin Qiao says specially-trained models are key when it comes to building fast AI applications for business.