Every IT company today must be trying to incorporate generative AI into their own apps, hopefully in such a way that it will bring value to the user, and loyalty to the vendor.
Social networking service LinkedIn recently incorporated GenAI into two of its services and shared its learnings last week in a blog post.
The post draws from company experience in building two “premium” LinkedIn services, one for summarizing text in a post, and the other for recommending job posts.
“The good thing about generative AI is that it is really democratized and lowered the bar for AI development. You can get an initial prototype out really, really fast,” said Karthik Ramgopal, Distinguished Engineer at LinkedIn. “But getting the quality to an acceptable level, with a large scale of inputs and outputs, takes a very, very long time.
This leads us to our first lesson in GenAI design…
#1 Don’t Chart Future Progress on Early Momentum (80/20 Rule)
After charting out a roadmap, LinkedIn’s development team were please to find that it had completed 80% of the basic design within the first month. Surely, the app would be ready forthwith.
That turned out to be very much not the case. In fact, it took an additional four months to get within a 95% completion rate.
For one, hallucinations kept plaguing the system, despite considerable efforts to curtail them. Other quality issues persisted as well.
The early progress “creates unattainable expectations,” the researchers wrote in the post. “The initial pace created a false sense of ‘almost there,’ which became discouraging as the rate of improvement slowed significantly for each subsequent 1% gain.”
In other words, for project teams, an AI project might have an extra-treacherous Valley of Death to traverse.
“You could be fooled by the deceptively fast initial progress,” Ramgopal said. “But battle-testing it with a large variety of inputs, controlling the hallucinations, making sure the [output] is factual, making sure the voice and the tone are in line with what you want it to be, ensuring that responses don’t take forever, and the latency is acceptable — all these things take a very long time to get right.”
These models have a “mind of their own,” Ramgopal said. Fixing one issue may trigger other issues in other parts of the application.
#2 RAG Makes the LLMs Work
LinkedIn’s parent company is Microsoft, which also owns a significant stake in OpenAI. So it has access to one of the best LLMs available. Yet an LLM on its own can’t answer all the questions, nor does it have, on its own, access to LinkedIn’s own rich troves of user data. And re-training the LLM or build one anew, would be prohibitively expensive, even for LinkedIn.
So, the company used Retrieval Augmented Generation (RAG) pipeline, which in the process of answering a question, can call internal APIs and even external sources such as Bing, and then inject the responses back to the LLM. In this approach, the LLM can tap the external resource in a function call. For an article to be summarized, it is necessary for the LLM to read the article first, and then apply its own knowledge base to interpret the results.
There is additional work to be done even to prepare externally data for the RAG, however. All LLMs have a limit to how much contextual information they can ingest, so filtering and even fine-tuning of additional data still needs to be done.
#3 Describe Your APIs
Outside their typical diet of json, large language models (LLMs) are pretty dumb creatures, unable to circumnavigate the world around them.
LinkedIn has a wealth of unique information about its users, and their skills, the training materials available to them, and so on. While much of the data can be programmatically called up easily enough via RPC APIs, they are not easy for LLMs to use.
“A lot of the APIs which are designed right now, internal or external, aren’t very LLM-friendly, they are designed for human engineers to call via code,” Ramgopal said. Here is where many of the hallucinations come in, a byproduct of not understanding how to work the API to get accurate info.
So, LinkedIn embarked on a project to wrap a schema describing available “skills” around these APIs to help LLMs use them. The OpenAPI standard, for instance, offers a way for APIs to describe themselves.
Each skill has a human-readable description about what the API does, along with the configuration needed to call it.
#4 Don’t Split the Elephant Into Too Many Pieces
Working in parallel is good — within limits.
Building one of the first GenAI projects at LinkedIn, the team did not have a lot of pre-existing resources to fall back on. Almost everything short of the LLM itself had to be built from scratch.
Within LinkedIn, different agents were built by different teams, each with their own area of specialization. This sped the development process, though it came with the cost of fragmentation.
“You have to be very cautious about how much work you ask the LLM to do in a prompt,”
–Karthik Ramgopal
“Maintaining a uniform user experience became challenging when subsequent interactions with an assistant might be managed by varied models, prompts, or tools,” the post noted.
To smooth the user experience, LinkedIn created a small ‘horizontal’ engineering pod to build the components common to all the features being built, including testing tools, prompts and shared UX components.
#5 Evaluation Will Be a Challenge
GenAI programs are different from regular applications. Judging their success requires a new type of evaluation, one that can’t be easily automated.
“In order to evaluate, you need to have an objective set of guidelines. Otherwise, you’re going to get scores all over the place,” Ramgopal said. “You are essentially evaluating a subjective, non-deterministic product, so it’s hard to come up with an objective set of guidelines.”
For instance, just returning a correct answer to a question is no longer sufficient. The voice and tone of how that answer is delivered must also be considered. A person asking if they are fit for a particular job would consider the service rude if its reply was “You are a terrible fit” (even if it were accurate).
Fortunately, LinkedIn had an internal linguist team, which built a set of tooling and processes to build metrics around hallucination rate, responsible AI violations, coherence, style and other factors.
The group is currently in the process of automating the evaluation pipeline.
Bonus Learning: GenAI Is All About Latency vs. Accuracy
Building generative AI applications is all about the trade-off between latency and accuracy.
“It is a game you have to play very carefully,” Ramgopal said. “You have to be very cautious about how much work you ask the LLM to do in a prompt.”
For instance, one way to get LLMs to stop hallucinating is through chain-of-thought prompting, where you ask the LLM to reveal the steps of reasoning that it took to arrive at the answer. The downside, however, is that it extends the time it takes to deliver the answer to the user.
There are ways around this, of course: One is streaming the answer, or giving users the parts of the answer as they come up, so they don’t have to wait for the entire answer. But major decisions around latency vs. accuracy will have to be made in the architecture design phase of development.
“You will be tempted into using an LLM everywhere. But you should be very cautious about where to use it when to use it,” Ramgopal said. “LLM is like a bulldozer. You don’t want to use it if you want to simply knock a nail off. Something may be a lot cheaper, a lot more effective and a lot easier to use, like a traditional model or even a rules engine of business logic.”
The post 5 Lessons From LinkedIn’s First Foray Into GenAI Development appeared first on The New Stack.
LinkedIn has found that prototyping a Generative AI-based feature can be done really quickly. Getting it into production, however, is another matter entirely.