2 authors say OpenAI 'ingested' their books to train ChatGPT. Now they're suing, and a 'wave' of similar court cases may follow.

Technology@lemmy.world – 491 points – 1 years ago

2 authors say OpenAI 'ingested' their books to train ChatGPT. Now they're suing, and a 'wave' of similar court cases may follow.

Two authors sued OpenAI, accusing the company of violating copyright law. They say OpenAI used their work to train ChatGPT without their consent.

You are viewing a single comment

View all comments Show the parent comment

That's incorrect. Sure it has no comprehension of what the words it generates actually means, but it does understand the patterns that can be found in the words. Ask an AI to talk like a pirate, and suddenly it knows how to transform words to sound pirate like. It can also combine data from different text about similar topics to generate new responses that never existed in the first place.

Your analogy is a little flawed too, if you mixed all the elements in a transformative way and didn't re-use any materials as-is, even if you called it Mazefecootviltale, as long as the original material were transformed sufficiently, you haven't infringed on anything. LLMs don't get trained to recreate existing works (which would make it only capable of producing infringing works), but to predict the best next word (or even parts of a word) based on the input information. It's definitely possible to guide an AI towards specific source materials based on keywords that only exist in the source material that could be infringing, but in general it generates so generalized that it's inherently transformative.

Again, that's not comprehension, that's mixing in yet more data that was put into the model. If you ask an AI to do something that is outside of the dataset it was trained on, it will massively miss the mark. At best, it will produce something that is close to what you asked, but not quite right. It's why an AI model that could beat the world's best Go players was beaten by a simple strategy that even amateur Go players could catch and defeat--the AI never came across that strategy while it was training against itself, so it had no idea what was going on.

And fair use isn't the bulletproof defense you think it is. Countless fan games have been shut down over the decades, most of them far more transformative than my hypothetical example, such as AM2R. You bet your ass that if I tried to profit off of that hypothetical crossover roguelike, using sprites, models, and textures directly ripped from their respective games, it would be shut down immediately.

EDIT: I also want to address the assertion that AI isn't trained to recreate existing works; in my view, that's wholly irrelevant. If I made a program that took all the Harry Potter books, ran each word through a thesaurus, and sold it for profit, that would still be infringing, even if no meaningful words were identical to the original source material. Granted, if I curated the output and made a few of the more humorous excerpts available for free through a Mastodon or Lemmy post, that would likely qualify as fair use. However, that would be because a human mind is parsing the output and filtering out the 99% of meaningless gibberish that a thesaurus-ized Harry Potter would result in.

The only human input to an AI that gave consent to being part of its output is the miniscule input of the prompt given to it by the human, which does not meet the minimis effort required for copyright protection under law. The rest of the input--the countless terabytes of data scraped from the internet and fed into the AI's training model--was all taken without the author's consent, and their contribution vastly outweighs that of the prompt author and OpenAI's own transformative efforts via the LLM.

You seem to misunderstand what an LLM does. It doesn't generate "right" text. It generates "probable" text. There's no right or wrong since it only generates a single word ahead of where it currently is. Hence why it can generate information that's complete bullshit. I don't know the details about this Go AI you're talking about, but it's pretty safe to say it's not an LLM or uses a similar technique to it as Go is a game and not a creative work. There are many techniques for creating algorithms that fall under the "AI" umbrella.

Your second point is a whole different topic. I was referring to a "derivative work", which is not the same as "fair use". Derivative works are quite literally everywhere. https://en.wikipedia.org/wiki/Derivative_work A derivative work doesn't require fair use, as it no longer falls under the same copyright as the original. While fair use is an exception under which copyrightable work can be used without infringing.

And also, those projects most of the time do not get shut down because they are actually illegal, but they get shut down because companies with tons of money can send threatening letters all day and have a team of high quality lawyers to send them. A cease and desist isn't a legal enforcement from a judge, it's a "recommendation for us not to (attempt to) sue you". And that works on most small projects. It very very rarely goes to court over these things. And sometimes it's because it's totally warranted. Especially for fan projects it's extremely hard to completely erase all protected copyrightable work, since they are specifically made to at least imitate or expand upon what they're a fan project of.

EDIT: Minor clarification

Also, it should be mentioned that pretty much all games are in some form derivative works. Lets take Undertale since I'm most familiar with it. It's well known that Undertale takes a lot of elements from other games. RPG mechanics from Mother and Earthbound. Bullet hell mechanics from games like Touhou Project. And more from games like Yume Nikki, Moon: Remix RPG Adventure, Cave Story. And funnily enough, the creator has even cited Mario & Luigi as a potential inspiration.

So why was it allowed to exist without being struck down? Because it fits the definition of a derivative works to the letter. You can find individual elements which are taken almost directly from other games, but it doesn't try to be the same as what it was created after.

Undertale was allowed to exist because none of the elements it took inspiration from were eligible for copyright protection. Everything that could have qualified for copyright protection--the dialogue, plot, graphical assets, music, source code--were either manually reproduced directly by Toby Fox and Temmie Chang, or used under permissive licenses that allowed reproduction (e.g. the GameMaker Studio engine). Meanwhile, the vast majority of content OpenAI used to feed its AI models were not produced by OpenAI directly, nor were they obtained under permissive license.

So... thanks for proving my point?

Meanwhile, the vast majority of content OpenAI used to feed its AI models were not produced by OpenAI directly, nor were they obtained under permissive license.

That's input, not output, so not relevant to copyright law. If your arguments focused on the times that ChatGPT reproduced copyrighted works then we can talk about some kind of ContentID system for preventing that before it happens or compensating the creators of it does. I think we can all acknowledge that it feels iffy that these models are trained on copyrighted works but this is a brand new technology. There's almost certainly a win-win outcome here.

The AI models (not specifically OpenAI's models) do not contain the original material they were trained on. Just like the creators of Undertale consumed the games they were inspired by into their brain, and learned from them, so did the AI learn from the material it was trained on and learned how to make similar yet distinctly different output. You do not need a permissive license to learn from something once it has been publicized.

You can't just put your artwork up on a wall and then demand every person who looks at it to not learn from it while simultaneously allowing them to look at it because you have a license that says learning from it is not allowed - that's insane and hence why (as far as I know) no legal system acknowledges that as a legal defense.

"right" and "probable" text are distinctions without difference. The simple fact is that an AI is incapable of handling anything outside its learning dataset. If you ask an AI to talk like a pirate, and it hasn't had any pirate speak fed to it by a human via its training dataset, it will utterly fail. If I ask an AI to produce a Powershell script, and it hasn't had code fed to it by a human via its training dataset, it will fail utterly. An AI cannot proactively buy a copy of Learn Powershell In a Month of Lunches and teach itself how to use Powershell. That fundamental shortcoming--the inability to self-improve, to proactively teach itself and apply that new knowledge to existing concepts--is a crucial, necessary element of transformative effort required to produce a derivative work (or fair use).

When that happens, maybe I'll buy that AI is anything more than the single biggest copyright infringement scheme the world has ever seen. Until then, though, I will wholeheartedly support the efforts of creative minds to defend their intellectual property rights against this act of blatant theft by tech companies profiting off their work.

You realize LLMs are designed not to self improve by design right? It's totally possible and has been tried - It's just that they usually don't end up very well once they do. And LLMs do learn new things, they're just called new models. Because it takes time and resources to retrain LLMs with new information in mind. It's up to the human guiding the AI to guide it towards something that isn't copyright infringement. AIs don't just generate things on their own without being prompted to by a human.

You're asking for a general intelligence AI, which would most likely be comprised of different specialized AIs to work together. Similar to our brains having specific regions dedicated to specific tasks. And this just doesn't exist yet, but one of it's parts now does.

Also, you say "right" and "probable" are without difference, yet once again bring something into the conversation which can only be "right". Code. You cannot create code that is incorrect or it will not work. Text and creative works cannot be wrong. They can only be judged by opinions, not by rule books which say "it works" or "it doesn't".

The last line is just a bit strange honestly. The biggest users of AI are creative minds, and it's why it's important that AI models remain open source so all creative minds can use them.

You realize LLMs are designed not to self improve by design right? It’s totally possible and has been tried - It’s just that they usually don’t end up very well once they do.

Tay is yet another example of AI lacking comprehension and intelligence; it produced racist and antisemitic content because it had no comprehension of ethics or morality, and so it just responded to the input given to it. It's a display of "intelligence" on the same level as a slime mold seeking out the biggest nearby source of food--the input Tay received was largely racist/antisemitic, so its output became racist/antisemitic.

And LLMs do learn new things, they’re just called new models. Because it takes time and resources to retrain LLMs with new information in mind. It’s up to the human guiding the AI to guide it towards something that isn’t copyright infringement.

And the way that humans do that is by not using copyrighted material for its training dataset. Using copyrighted material to produce an AI model is infringing on the rights of the people who created the material, the vast majority of whom are small-time authors and artists and open-source projects composed of individuals contributing their time and effort to said projects). Full stop.

Also, you say “right” and “probable” are without difference, yet once again bring something into the conversation which can only be “right”. Code. You cannot create code that is incorrect or it will not work. Text and creative works cannot be wrong. They can only be judged by opinions, not by rule books which say “it works” or “it doesn’t”.

Then why does ChatGPT invent Powershell cmdlets out of whole cloth that don't exist yet accomplish the exact precise task that the prompter asked it to do?

The last line is just a bit strange honestly. The biggest users of AI are creative minds, and it’s why it’s important that AI models remain open source so all creative minds can use them.

The biggest users of AI are techbros who think that spending half an hour crafting a prompt to get stable diffusion to spit out the right blend of artists' labor are anywhere near equivalent to the literal collective millions of man hours spent by artists honing their skill in order to produce the content that AI companies took without consent or attribution and ran through a woodchipper. Oh, and corporations trying to use AI to replace artists, writers, call center employees, tech support agents...

Frankly, I'm absolutely flabbergasted that the popular sentiment on Lemmy seems to be so heavily in favor of defending large corporations taking data produced en masse by individuals without even so much as the most cursory of attribution (to say nothing of consent or compensation) and using it for the companies' personal profit. It's no different morally or ethically than Meta hoovering all of our personal data and reselling it to advertisers.

You're shifting the goal post. You wanted an AI that can learn stuff while it's being used and now you're unhappy that one existed that did so in a primitive form. If you want a general artificial intelligence that is also able to understand the words it says, we are still decades off. For now it can simply only work off patterns, for which the training data needs to be curated. And as explained previously, it's not infringing on copyright to train things on publicized works. You are simply denying that fact because you don't want that to be true, but it is. And that's why your sentiment isn't shared outside of some anti-AI circle you're part of.

The biggest users of AI are techbros who think that spending half an hour crafting a prompt to get stable diffusion to spit out the right blend of artists’ labor are anywhere near equivalent to the literal collective millions of man hours spent by artists honing their skill in order to produce the content that AI companies took without consent or attribution and ran through a woodchipper. Oh, and corporations trying to use AI to replace artists, writers, call center employees, tech support agents…

So because you don't know any creative people who use the technology ethically, they don't exist? Good to hear you're sticking it up for the little guy who isn't making headlines or being provocative. I don't necessarily see these as ethical uses either, but I would be incredibly disingenuous to insinuate these are the only and primary ways to use AI - They are not, and your ignorance is showing if you actually believe so.

Frankly, I’m absolutely flabbergasted that the popular sentiment on Lemmy seems to be so heavily in favor of defending large corporations taking data produced en masse by individuals without even so much as the most cursory of attribution (to say nothing of consent or compensation) and using it for the companies’ personal profit. It’s no different morally or ethically than Meta hoovering all of our personal data and reselling it to advertisers.

I'm sorry, but you realize that this doesn't make any sense right? Large corporations are the ones who would have enough information and/or money at their disposal to train their own AIs without relying on publicized works. Should any kind of blockade be created to stop people training AI models from using public work, you would effectively be taking AI away from the masses in the form of Open Source models, not from those corporations. So if anything, it's you who is arguing for large corporations to have a monopoly on AI technology as it currently is.

Don't think I actually like companies like OpenAI or Meta, it's why I've been arguing about AI models in general, not their specific usage of the technology (As that is a whole different can of worms).

I'm not shifting the goal post--I have been consistent in my position that AI does not truly "learn" in the way that humans do, and is incapable of the comprehension required for actual human creativity. Tay spouting racist rhetoric because that's what was put into it supports that position, if anything; if it were capable of comprehending the language it was being fed, it wouldn't have done that.

You have stated that it's not infringing on copyright to train a model on published works, yes. I wholeheartedly disagree, because, as I have previously stated, AI models as they currently exist cannot produce new, derivative works based off the training model, but only reconstitute the training model together in various different combinations. This is important because one of the requirements for copyright protection, as per the US Copyright Office, is that it's an independent creation, which "means that the author created the work without copying from other works." AI's inability to create its own work without copying from other works means that it cannot produce copyrightable material.

As a result, if you input an infringing dataset into an AI's training model, the resulting output is also infringing, because it is not, and cannot, be transformative to the level required to meet the minimal creativity threshold needed for copyright protection. At best, you can make an argument that the infringement in an AI's output is acceptable under the de minimis doctrine (i.e. that the amount of the copyrighted work contained in an infringing work is so trivial as to not warrant protection). However, my belief is that if a hypothetical composite work takes all of its source material from 100 different copyrighted sources, it wouldn't qualify for de minimis protection because the composite work is 100% infringing, even though each individual source only contributed 1% to the total work.

To summarize, my line of thinking is as follows:

The specific output of an AI does not in of itself qualify for copyright protection because no human minds were involved in creating it, except for the mind that gave the AI the prompt; however, this involvement is not significant enough to overcome the minimal creativity standard required for copyright protection. This is the position of the US Copyright Office (page 7, The Human Authorship Requirement):

The U.S. Copyright Office will register an original work of authorship, provided that the work was created by a human being. The copyright law only protects “the fruits of intellectual labor” that “are founded in the creative powers of the mind.” Trade-Mark Cases, 100 U.S. 82, 94 (1879). Because copyright law is limited to “original intellectual conceptions of the author,” the Office will refuse to register a claim if it determines that a human being did not create the work.

Since the specific output of an AI model lacks any copyright protection, that output does not qualify for any related defenses such as fair use because as these defenses require significant transformative effort of the work in question. If something cannot be transformative, novel, or new enough to qualify for copyright protection in the first place, it's impossible for it to be transformative enough for a fair use defense. It also cannot qualify for copyright protection as a compilation or derivative work, as they both must contain copyrightable subject matter--since the AI output is not copyrightable, they cannot be claimed as either compliations or derivatives.
As a result, if the training dataset input to an AI model is infringing, then the output of that AI model is also infringing, since the output does not independently qualify for copyright protection, nor can they leverage related defenses.

I’m sorry, but you realize that this doesn’t make any sense right? Large corporations are the ones who would have enough information and/or money at their disposal to train their own AIs without relying on publicized works. Should any kind of blockade be created to stop people training AI models from using public work, you would effectively be taking AI away from the masses in the form of Open Source models, not from those corporations. So if anything, it’s you who is arguing for large corporations to have a monopoly on AI technology as it currently is.

Large corporations and open-source AI models are scraping our IP without consent because they think they can get away with it, and because it's easier to steal it than properly obtaining consent from the people whose content they are using. And to be clear, I don't give a shit if preventing AI from stealing copyrighted content kills large open-source AI tools. If the only way they can be useful is by committing mass infringement, then they don't deserve to exist. They can either use their own internally-developed datasets, datasets that only draw from the public domain, obtain the consent (which may or may not include royalties) from creators, or wither on the vine. That applies to both open-source and commercial AI technology.

Finally, I want to make it 100% clear that I have no issues with AI models that do not use copyrighted material in their training datasets. My employer introduced an AI chatbot trained entirely on our internal and public knowledgebases, and I'm perfectly fine with that morally/ethically/legally. (Personally, I think it's a little useless since the last time I used it the damn thing confidently gave me a false answer with fake links to nonexistent KB articles, but that's besides the point.) My entire issue with AI is centered around the unlicensed use of copyrighted material by AI models without the creator's consent, attribution, or compensation.