Rethinking open source generative AI: open washing and the EU AI Act

Open Source@lemmy.ml – 68 points – 4 months ago

dl.acm.org

Cross-posting to the OpenSource community as I think this topic will also be of interest here.

This is an analysis of how "open" different open source AI systems are. I am also posting the two figures from the paper that summarize this information below.

ABSTRACT

The past year has seen a steep rise in generative AI systems that claim to be open. But how open are they really? The question of what counts as open source in generative AI is poised to take on particular importance in light of the upcoming EU AI Act that regulates open source systems differently, creating an urgent need for practical openness assessment. Here we use an evidence-based framework that distinguishes 14 dimensions of openness, from training datasets to scientific and technical documentation and from licensing to access methods. Surveying over 45 generative AI systems (both text and text-to-image), we find that while the term open source is widely used, many models are ‘open weight’ at best and many providers seek to evade scientific, legal and regulatory scrutiny by withholding information on training and fine-tuning data. We argue that openness in generative AI is necessarily composite (consisting of multiple elements) and gradient (coming in degrees), and point out the risk of relying on single features like access or licensing to declare models open or not. Evidence-based openness assessment can help foster a generative AI landscape in which models can be effectively regulated, model providers can be held accountable, scientists can scrutinise generative AI, and end users can make informed decisions.

Figure 2 (click to enlarge): Openness of 40 text generators described as open, with OpenAI’s ChatGPT (bottom) as closed reference point. Every cell records a three-level openness judgement (✓ open, ∼ partial or ✗ closed). The table is sorted by cumulative openness, where ✓ is 1, ∼ is 0.5 and ✗ is 0 points. RL may refer to RLHF or other forms of fine-tuning aimed at fostering instruction-following behaviour. For the latest updates see: https://opening-up-chatgpt.github.io

Figure 3 (click to enlarge): Overview of 6 text-to-image systems described as open, with OpenAI's DALL-E as a reference point. Every cell records a three-level openness judgement (✓ open, ∼ partial or ✗ closed). The table is sorted by cumulative openness, where ✓ is 1, ∼ is 0.5 and ✗ is 0 points.

There is also a related Nature news article: Not all ‘open source’ AI models are actually open: here’s a ranking

PDF Link: https://dl.acm.org/doi/pdf/10.1145/3630106.3659005

Thank you for bringing more awareness of this. I'm what you might call an "AI skeptic" and don't really care what happens in the AI space as long as it doesn't screw up things I care about.

But I care deeply about FOSS and AI is screwing it up. I don't want to have to explain why XYZ thing absolutely is not Open Source and that "Open Source" has a specific meaning beyond "you can look at (at least some of) the source code."

(Compare it to the term "hacker" that has among at least a lot of muggles taken on the exclusive meaning of committing some kind of fraud with computers. Originally it meant something very different. And it's unfortunate the world has forgotten the old meaning.)

Another project that is diluting the term "Open Source" is Grayjay, a video streaming app that is a FUTO project (and FUTO is a Louis Rossman thing.) Rossman has called it Open Source in YouTube videos, but it's not Open Source. (The license is here and forbids things like "commercial use" (selling the software or derivative works) and removing facilites for paying the FUTO project from derivative works. Which is a lot less restrictive than the license was last time I checked it. Previously it didn't allow redistribution or derivative works at all. But it's not Open Source even now.)

I did not know of the term "open washing" before reading this article. Unfortunately it does seem like the pending EU legislation on AI has created a strong incentive for companies to do their best to dilute the term and benefit from the regulations.

There are some paragraphs in the article that illustrate the point nicely:

In 2024, the AI landscape will be shaken up by the EU's AI Act, the world's first comprehensive AI law, with a projected impact on science and society comparable to GDPR. Fostering open source driven innovation is one of the aims of this legislation. This means it will be putting legal weight on the term “open source”, creating only stronger incentives for lobbying operations driven by corporate interests to water down its definition.

[.....] Under the latest version of the Act, providers of AI models “under a free and open licence” are exempted from the requirement to “draw up and keep up-to-date the technical documentation of the model, including its training and testing process and the results of its evaluation, which shall contain, at a minimum, the elements set out in Annex IXa” (Article 52c:1a). Instead, they would face a much vaguer requirement to “draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model according to a template provided by the AI Office” (Article 52c:1d).

If this exemption or one like it stays in place, it will have two important effects: (i) attaining open source status becomes highly attractive to any generative AI provider, as it provides a way to escape some of the most onerous requirements of technical documentation and the attendant scientific and legal scrutiny; (ii) an as-yet unspecified template (and the AI Office managing it) will become the focus of intense lobbying efforts from multiple stakeholders (e.g., [12]). Figuring out what constitutes a “sufficiently detailed summary” will literally become a million dollar question.

Thank you for pointing out Grayjay, I had not heard of it. I will look into it.

As long as they aren't putting ridiculous terms on model usage like SD3 and the weights are provided I'm happy with it

A bunch of these columns are outright absurd TBH, to the extend I'm not sure the author really knows what FOSS is about. What's open API access even supposed to be - API access is closed by definition.

Also there has never been a requirement that open source software needs to be documented - and for good reason - so I'm not a fan of the documentation column as well.

and for good reason

I'd love to hear that reasoning. Personally, I will avoid using a FOSS product if the documentation is terrible or non-existent. Obviously I have grace for new* or bleeding-edge projects. But I've avoided using some FOSS stalwarts simply because I don't have the time to dedicate to trial and error learning.

Because FOSS shouldn't add burdens. You publish your work and let everyone else use it. That shouldn't add extra obligations on you. Usually, you'd also write some docs - after all, without them nobody will know how to use your program, so why bother publishing - but it shouldn't be an obligation. Make it easy for people to open up their code without this attaching strings.

Documentation is nice, but it's kind of different thing that open source: a program can be open and undocumented, or closed but well documented - and I don't see why we'd want it different for models.

That's fair, thank you for explaining. I was going to say but forgot, this is assessing specifically for "openness" not 'open source-ness' though.

upcoming EU AI Act that regulates open source systems differently, creating an urgent need for practical openness assessment

So when they say "openness" they do put it in the context of open source rather accessibility.