

(Connect-world/Shutterstock)
What does “open” mean in the context of AI? Must we accept hidden layers? Do copyrights and patents still hold sway? And do consumers have the right to opt out of data collection? These are the types of questions that the folks at the Open Source Initiative are trying to get to the bottom of, as part of a deep dive to define “open source AI.”
The rules around what could be considered open source in tech used to be fairly well-defined, according to Stefano Maffulli, the executive director of the Open Source Initiative. Back in the 1970s, it was generally accepted that only things generated by a human could be legally protected with a copyright or a patent. Stuff generated by a machine, such as binary code, generally could not be protected.
That began to change with the PC revolution in the 1980s and Microsoft’s massive success selling software. Following several policy changes and landmark lawsuits, people began seeking and gaining protection for things such as source code and machine-generated binary code, Maffulli says.
With the advent of massive generative AI models that are trained on public data scraped from the Internet, we find ourselves at the edge of what current copyright law can cover. In fact, according to Maffulli, we’ve likely already passed that point, and now find ourselves in dire need of new ideas and new frameworks to define what can and should be protected, and what can and should be open and accessible to all.
“When [GitHub] CoPilot was announced [in October 2021], it suddenly dawned that there were new copyright issues appearing on the horizon,” Maffulli tells Datanami in a recent interview. “Then I started diving a little bit deeper into how AI [works], how machine learning, deep learning, neural networks work, and it dawned on me again that there were new artifacts, new things. And we were really at the dawn of a new era where we need new laws, we need new frameworks to understand what’s happening. And we need to do that very quickly.”
OSI ‘Deep Dive’

You can access the OSI deep dive report on open AI here
With its “Defining Open Source Deep Dive” program, the OSI organization is taking a disciplined and multi-pronged approach to understanding all aspects of the openness in AI question.
It set the process in motion earlier this year with a 20-page report on AI openness in February. In early June, it posted a public call for papers and research on the topic, followed by a set of kickoff meetings in San Francisco later that month. There were two community review workshops in July, in Oregon and Switzerland, followed by a third workshop last week in Spain.
If all goes according to schedule, OSI hopes to submit the first release candidate of a new definition of open source for AI paper next month. The process will continue into 2024, according to the group’s website.
The group is trying to remain open to all perspectives in coming up with its definition and policy recommendations. “It largely depends on what people want to do,” Maffulli says. “At the Open Source Initiative, we’re just driving this conversation. We’re not really forcing our opinions on anyone.”
A New Age of Data
The radical openness that defined the first 40 years of the Internet served the community well and sowed the seeds of technological progress to come. The egalitarianism of the Internet’s first phase of development fostered a community that thrived with openness and a ethos of sharing.
That started to change with the dawn of the big data age and the advent of social media and smart phones. Tech firms realized they could scrape the Internet for data freely shared by users, as well as some data not freely shared but still available (such as books), to amass huge data sets. Those data sets are now being used to train massive generative AI models that have the potential to not only reshape consumers’ relationship with technology for years to come, but also separate winners from losers on the corporate and creative battlefields.
One of the big questions that OSI is struggling with is: Does current copyright law still work in the age of AI? The answer hasn’t been determined yet, but it doesn’t look like it will.
“I think we’re at the point where we should make a decision whether we want those to be covered by copyright or whether we need to create new rights and new obligations for society,” Maffulli says. “What’s the best approach?”
There are different perspectives to these questions, and each deserves to be considered. The debate touches on several aspects of intellectual property rights, including copyrights, patents, trademarks, and trade secrets. But it’s also tied up into privacy rights, security obligations, and labor law, which adds to the complexity.
Maffulli says he understand the plight of creative workers whose past work can be harnessed to train a GenAI model that can re-create that workers’ output, potentially putting him out of work. Is there any legal recourse for him? Should he be granted legal protections? It’s tempting, he says.
“The reaction to that is to say, wait a second, you have been feeding my images, my text, into this machine and now this machine is capable of replacing me? No!” he says. “I have copyright rights on the work that I have produced. I never authorized anyone to use the archive of my work as a data mining source. Therefore, I want you to ask me for permission. I think that that’s a very fair approach a very fair reaction.”
However, if communities and government opt to stiffen data protections, it will naturally make it harder to obtain data to train AI models. That will not only slow down the overall rate of AI innovation, but it will likely also have the side effect of entrenching the already dominant positions that OpenAI, Google, and Meta enjoy in the space, he says.
“I think the biggest threat is there will not be the possibility to have a diverse amount of players in the field,” he says. “This is a field that naturally, at every step, favors the ones with the big resources, large amounts of resources. Because the main three components are data, knowledge, and hardware.”
The tech giants already have the data, which they have been systematically scraping from the Internet for years. They have the financial resources to afford the giant GPU clusters needed to train AI models. And they naturally attract the top minds in the field as a byproduct of having massive GPU clusters and lots of data to play with.
Maffulli sounds pragmatic about the potential to enact meaningful change by strengthening copyright protections. The tech giants already have the means to bury lawsuits brought by individuals, he says. And besides, they already have all the data. In many cases, they acquired it fair and square, thanks to consumers’ tendency to click “yes” on every privacy policy dialog box they’re presented.
‘Cat’s Out of the Bag’
For years Maffulli shared his image and title liberally across the Web. Then at one point, he tried to rein in back in by deleting his image on every major site. It’s his likeness and his right, he figured. He would force the tech giants to forget they ever saw him, he thought. At some point, he realized it was likely impossible.
That experience has informed his view on what is possible to be done with data and the open future of AI. “I think it’s better off if we just let it go,” Maffulli says. “The cat is out of the bag.”
In other words, instead of trying to put the cats back in the bag, we are better off just managing the loose cats as best we can. That means stronger operational controls on data that’s already out in the open, and better guardrails to guide those cats to happy homes.
“I do think that it cannot be solved by copyright law,” Maffulli says. “It needs to be solved by having strong policy, privacy protection laws, strong control from the individual to say ‘I don’t want to be recognized. Therefore, even if you have my face in the database, it gets deactivated. You cannot use it.’”
There are plusses and minuses to open source and to copyright protections, and they must be weighed carefully. OSI’s policy is not to judge how practitioners use open source software, noting that it’s impossible to draw a line between moral and immoral uses. As the debate plays out over what open means in AI, that line is murkier than ever.
Related Items:
Why Truly Open Communities are Vital to Open Source Technology
Do Customers Want Open Data Platforms?
Open Data Hub: A Meta Project for AI/ML Work
March 28, 2025
- Datadobi Releases StorageMAP 7.2 with Enhanced Metadata and Object Storage Discovery
- Carnegie Mellon Expands AI Research with Google-Powered Cloud GPU Cluster
- Trillion Parameter Consortium Partners with Tabor Communications to Launch Global AI for Science Conference, TPC25
March 27, 2025
- IBM Expands On-Prem Offerings with Storage Ceph as a Service
- Dataminr Partners with WWT to Launch Unified Cyber-Physical Threat Intelligence Platform
- Dataiku Announces 2025 Partner Award Winners
- Marvell Showcases PCIe Gen 6 Optical Interconnect for AI Infrastructure
- Akamai Launches Cloud Inference to Boost AI Workloads at the Edge
- Prophecy Introduces Fully Governed Self-Service Data Preparation for Databricks SQL
- Verdantis Launches Next-Gen AI Solutions to Transform Enterprise Master Data Management
- TDengine Releases TDgpt, Extending the Power of AI to the Industrial Sector
March 26, 2025
- Quest Adds GenAI to Toad to Bridge the Skills Gap in Modern Database Management
- SymphonyAI Expands Industrial AI to the Edge with Microsoft Azure IoT Operations
- New Relic Report Reveals Media and Entertainment Sector Looks to Observability to Drive Adoption of AI
- Databricks and Anthropic Sign Deal to Bring Claude Models to Data Intelligence Platform
- Red Hat Boosts Enterprise AI Across the Hybrid Cloud with Red Hat AI
March 25, 2025
- PayPal Feeds the DL Beast with Huge Vault of Fraud Data
- OpenTelemetry Is Too Complicated, VictoriaMetrics Says
- Accelerating Agentic AI Productivity with Enterprise Frameworks
- When Will Large Vision Models Have Their ChatGPT Moment?
- The Future of AI Agents is Event-Driven
- Your Next Big Job in Tech: AI Engineer
- Data Warehousing for the (AI) Win
- Nvidia Touts Next Generation GPU Superchip and New Photonic Switches
- Alation Aims to Automate Data Management Drudgery with AI
- Can You Afford to Run Agentic AI in the Cloud?
- More Features…
- Clickhouse Acquires HyperDX To Advance Open-Source Observability
- NVIDIA GTC 2025: What to Expect From the Ultimate AI Event?
- IBM to Buy DataStax for Database, GenAI Capabilities
- EDB Says It Tops Oracle, Other Databases in Benchmarks
- Google Launches Data Science Agent for Colab
- Grafana’s Annual Report Uncovers Key Insights into the Future of Observability
- Databricks Unveils LakeFlow: A Unified and Intelligent Tool for Data Engineering
- Reporter’s Notebook: AI Hype and Glory at Nvidia GTC 2025
- Excessive Cloud Spending In the Spotlight
- Big Data Heads to the Moon
- More News In Brief…
- Gartner Predicts 40% of Generative AI Solutions Will Be Multimodal By 2027
- Snowflake Ventures Invests in Anomalo for Advanced Data Quality Monitoring in the AI Data Cloud
- NVIDIA Unveils AI Data Platform for Accelerated AI Query Workloads in Enterprise Storage
- Accenture Invests in OPAQUE to Advance Confidential AI and Data Solutions
- Alation Introduces Agentic Platform to Automate Data Management and Governance
- Seagate Unveils IronWolf Pro 24TB Hard Drive for SMBs and Enterprises
- Gartner Identifies Top Trends in Data and Analytics for 2025
- Qlik Survey Finds AI at Risk as Poor Data Quality Undermines Investments
- Palantir and Databricks Announce Strategic Product Partnership to Deliver Secure and Efficient AI to Customers
- HighByte Launches API Builder for Industrial Data
- More This Just In…