
NEW YORK — At the United Nations OSPOs for Good Conference, we were once more reminded of the curious situation of AI and open source programs: While the foundations of AI are built on open source tools and libraries, almost no major AI program is truly open source. OpenAI’s ChatGPT, Google’s PaLM (and its successor, the multimodel Gemini), and Meta’s Llama-3 are often touted as open, but they’re not. They come with significant restrictions that don’t meet the definition of open source software.
Enter the Open Source Initiative (OSI), the stewards of the Open Source Definition. Recognizing the growing importance of AI and the need for clarity in this space, the OSI has embarked on an ambitious project to define what “open source AI” should mean. This effort brings together 70 experts, including researchers, lawyers, policymakers and representatives from tech giants like Amazon, Google and Meta.
That’s easier said than done. As Stefano Maffulli, OSI’s executive director, noted in a panel on open source and AI, “While there’s broad agreement on the overarching principles, it’s becoming obvious that the devil is in the details.”
The open source community is a big tent, encompassing everyone from basement hackers to grassroots activists to Fortune 500 companies, each with their own priorities and concerns.
In short, “we need to have new guardrails and new guidelines when it comes to what open source AI actually means,” said Ashley Kramer, GitLab’s chief marketing and strategy officer, during the panel discussion.
LLM Data Transparency: a Thorny Issue
It became clear from the panel’s discussion that the biggest challenge in defining open source AI lies in addressing the role of training data. Large language models (LLMs) rely on vast data sets, often scraped from the internet without explicit permission. This messy data raises thorny questions about privacy, copyright and ethics.
Indeed, we know some of this data is flatly illegal. “One of the largest data sets of images [LAION-5B] that is being used for training a lot of the image generation AI tools recently has contained child sexual abuse images,” Maffulli said. “We need data set maintainers to notice and remove those things.”
The OSI’s draft definition attempts to sidestep the data issues by focusing on the “four freedoms” traditionally associated with open source software: The freedom to use, study, modify and distribute the AI system. It focuses on the code and not the data.
Should an open source AI model be required to disclose its training data? If so, how can this be reconciled with privacy concerns and the practical challenges of sharing petabytes of information? The answer is not just yes but hell yes, to many critics of the OSI AI definition draft.
As Tom Callaway, principal open source technical strategist at Amazon Web Services, wrote before the conference on LinkedIn, “You cannot build an LLM without data. Without the data, the LLM doesn’t just lack any purpose; it doesn’t exist. That makes the data a functional and required source component of an LLM.”
He and others argue that any definition of open source AI will be incomplete without addressing the data issue.
Maffulli acknowledged that this is a real concern: “This needs to be debated and finalized.” But, he added, “pushing for radical openness for data has drawbacks and brings issues. So it’s going to be a balance of intentions and what’s going to be the best outcome for the general public.”
However, another panelist, Sasha Luccioni, AI and climate lead at Hugging Face, sees it another way. Luccioni believes being an open source purist is a mistake.
“You can’t really expect all companies to be 100% open source as it’s defined by the open source license,” she said during the panel. “That’s why there is a multitude of licenses. Saying that this is not true, open source can antagonize companies. You can’t expect companies to just give up everything that they’re making money off of and do so in a way that they’re comfortable with.”
She believes that “there’s a responsible AI license that can exist” — one that is open source friendly — “where you can kind of define your terms of open source. By tweaking the language a little bit, you can build forward in a way that companies, governments and academia are all comfortable with instead of saying this project or license is not open source.
‘We Have To Do It Together’
None of the open source advocates at the conference that The New Stack spoke with was pleased with this take. How ever the OSI AI definition works out, the issue of what is — and isn’t — open source AI remains critical to the open source community.
It’s also important outside the open source community. As Ambassador Philip Thigo, special envoy on technology for Kenya, observed in a keynote address at the conference devoted to open source and AI, “Open source AI ensures that many Global South communities can build their own AI programs and LLMs.”
These countries can’t afford to pay an OpenAI for their AI needs. They need open source, global standards and interoperability to build AI systems to address their health, climate and education needs.
Looking ahead, “we have to do it together,” Kramer said on the conference panel, indicating that open source is the way to do it.
“We must understand the data that was foundational for the model,” Kramer said. “While I love the hype around AI and I love the direction it’s going, we saw very similar patterns with the Internet and the rise of cloud technology. The faster we move, the more things we miss. So it takes a group and it takes an open source AI guardrail model to really figure out how to get there fast with privacy, trust and security at top of mind.”
Stay tuned. We’re still writing the story of open source AI. As the OSI and others grapple with these complex issues, the outcome will have profound implications for the future of AI development, innovation and governance. The challenge lies in finding a definition that preserves the spirit of openness while addressing the unique challenges posed by data. This task may require rethinking some long-held assumptions about what it means to be “open source” in the age of AI.
The post Open Source AI: What About Data Transparency? appeared first on The New Stack.
AI uses both code and data, and this combination continues to be a challenge for open source, said experts at the United Nations OSPOs for Good Conference.