PITTSBURGH — It’s not just hype: generative AI is a different kind of technology.
How different? Just try to define what “open source AI” means.
That’s the task ahead of the Open Souce Initiative (OSI), as it embarks on a roadtrip across continents to finalize a definition of “open source AI” that most of its stakeholders can live with.
PyCon US, held here in May in a city built on building things, marked the first stop in the roadtrip, which is being supported by the Alfred P. Sloan Foundation, Amazon, Cisco, and Google Open Source.
This month it’s on to Paris (OW2) and Madrid (OpenExpo Europe). The goal: have the definition wrapped up by the All Things Open conference in North Carolina in late October.
After two years of work, OSI has a draft definition, Stefano Maffulli, executive director of OSI, told The New Stack. The team is going through a “validation phase,” he said, making sure the definition includes everything that falls into the open source category, or is likely to. And it’s working on a FAQ.
Attendees at the PyCon workshop — roughly a dozen — were asked to help the team brainstorm questions and answers for the FAQ, making sure everything was being covered.
It’s the culmination of a project — and a larger mission for OSI — that Maffulli envisioned three years ago, when he was interviewing for the executive director role. He knew that AI would be the next big thing in tech.
OSI “needs to build its future on driving difficult conversations,” he said. “That’s what we want to do. And it was part of the mission already when I joined and I really took it as, the mission is as a convener of conversation. The Open Source Initiative lays the foundation for the open source ecosystem.
“This is the hardest conversation we can have right now. It needs to happen right now.”
Why AI Is Unprecedented in Open Source
But the very nature of AI, and of what open source has traditionally dealt with, has made the whole process challenging.
Maffulli offered historical context: “The open source world had it easy, because software and computer science have evolved over decades together with the concept of open source and free software. The evolution started at the beginning of the ‘80s. Computers were becoming more popular, software was being developed. More developers and more users of computer software were appearing and using it, and all of this was almost naturally evolving together.”
Previously, he said, the concept was simple: there’s source code, there’s binary code. Two representations of the same artifact. And for years, regulators paid little attention as the open source ecosystem grew.
“We have regulators freaking out, all around the world because these things are capable of doing things that computer scientists themselves say, ‘We don’t know why, we didn’t know how, we can’t really fix them. But trust us, it’s gonna be fine.’ And regulators are like: You are freaking us out.”
—Stefano Maffulli, executive director, Open Souce Initiative
“Now, all of a sudden AI comes in,” Maffulli said. “Especially this new generation of AI in the past three, four, five years, comes in, creates new artifacts. Now, the model weights and parameters are a brand new thing. They are functional, they change the status of systems. But they’re not software. They’re not source code. They’re not data, either. So they’re a new artifact.”
He added, “Then, the other thing: there are billions of people already using them.”
Also, by contrast to the previous history of open source software, Maffulli said, “We have regulators freaking out, all around the world because these things are capable of doing things that computer scientists themselves say, ‘We don’t know why, we didn’t know how, we can’t really fix them. But trust us, it’s gonna be fine.’ And regulators are like: You are freaking us out.”
Biggest Issues: Data and Certification
The PyCon workshops to build out the FAQ were lively, said Mer Joyce, founder of Do Big Good, a design consultancy, who is helping OSI with the project
“We had people writing on Post-Its and clustering and we came up with these different areas of questions,” she said.
At this stage, two issues have emerged as sticking points, according to her and Maffulli.
One is certification, or “how we’re gonna be running this analysis and actually certify that the system is an open source AI or not,” Maffulli said.
The other issue is what exactly constitutes data in an AI environment. And that is, to put it mildly, tricky.
“The wording in the draft right now, is vague on purpose,” Maffulli said. “It is using terms that the legal community understand.”
Currently, the draft defines “data information” used in open source AI like so:
Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.
For example, if used, this would include the training methodologies and techniques, the training data sets used, information about the provenance of those data sets, their scope and characteristics, how the data was obtained and selected, the labeling procedures and data cleaning methodologies.
There are a few keywords in here that need to be highlighted,” Maffulli said. “One is ‘sufficiently detailed information.’” What does that mean? We need to build into the FAQ by giving examples, looking at the systems out there and saying this is sufficiently detailed.”
Another key phrase: “skilled person.” “It’s not, like, everyone,” he added.
And also, “substantially equivalent.” Which doesn’t mean, he said, “vaguely resembles.”
“In the FAQ, the reason for this, we will have to explain it, we have to be clear. Because the people in the team were asking, ‘What does it mean, exactly?’ Well, let’s build examples as we go.”
In the definition, “the same or similar data” could refer to, for instance, synthetic data.
“In the cases where it’s necessary,” Maffulli elaborated, “it could be, let’s say, I cannot access data because I don’t have the right to distribute it to you. It’s copyrighted. Or, or it’s my secret [data], proprietary. I cannot tell you what it is. But I tell you enough, where I give you a sample, [you] can go and rebuild, with instructions on how to go on and rebuild something similar.”
In the PyCon workshop, he noted, a participant asked, “’What if the source of the data is the data set from Reddit?’ So that’s a few million dollars. Do you have a few million dollars in your pockets to license it? I don’t.”
But is there something that would constitute “the same or similar data,” from which someone could build a “substantially equivalent system”? Maffulli summed up, “That’s the question that we will have to ask.”
Next Stops: Paris and Madrid
And so the OSI road show makes its next stops this week in Paris and Madrid. The European Union is ahead of the rest of the world in setting out AI governance policies, as it implements an AI Act established in December.
Maffulli said he’s proud of the two-year effort so far to gather input across countries, backgrounds, expertise and interests.
“Why we got started was that there was no shared understanding of this environment,” Maffulli said. “And so all of us from academia, industry, researchers, developers, civil society, lawyers, collectively, we needed to have a very, very thoughtful, difficult conversation and bring knowledge up for all of us together.”
“We’re getting to the stage where, honestly, I’m really impressed by the amount of people that we’ve touched with this process.”
The post Open Source AI: OSI Wrestles With a Definition appeared first on The New Stack.
At PyCon US, Open Source Initiative enlisted help in hammering out a FAQ for its open source AI definition. Some very sticky sticking points remain.