Sarah Silverman and other authors are suing OpenAI and Meta for copyright infringement, alleging that they're training their LLMs on books via Library Genesis and Z-Library

Technology@beehaw.org – 217 points – 12 months ago

Sarah Silverman Sues ChatGPT Creator for Copyright Infringement

You are viewing a single comment

People keep taking issue with this articles use of "summarizing" and linking to wikipedia... Summaries of copyrighted work are obviously not illegal.

This article is oversimplified and does a crummy job of explaining the problem. Ars Technica does a much better job explaining.

The fact that the ai can summarize these works in detail is proof that they were trained using copyrighted material without permission, (which is not fair use) Sarah Silverman is obviously not going to be hurt financially by this, but there are hundreds of thousands of authors who definitely will be affected. They have every right to sue.

Why does "fair use" even fall into it? I'm not familiar with their specific license, but the general definition of copyright is:

A copyright is a type of intellectual property that gives its owner the exclusive right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time.

Nothing was copied, or distributed (in a form that anybody can consider "The Work"), or displayed, or performed. The only possible legal argument they have is adapting as a derivative work. And anybody who is familiar with how an LLM works knows that the form that results from reading in content is completely different from the source.

LLMs/LDMs are not taking in billions of books and putting them into a database. It is a very lossy process. Out of all of the billions of images trained from the Stable Diffusion database, the resulting model is 4 GBs. There is no universe where you can store billions of images into a mere 4 GBs. Stable Diffusion cannot and will not, pixel-by-pixel, reproduce a Van Gogh. It can make something that kind of looks like a Van Gogh, but styles are not copyrightable.

The same applies to an LLM like ChatGPT. It cannot reproduce entire books, or anywhere close to that. If you ask it to recreate Page 25 of Silverman's book, it can't do it. If it doesn't even contain a minor portion of the original material, it can't even be considered a derivative work.

They don't have a case. They have a lot of publicity and noise, but they will lose to inevitability.

You make a lot of excellent points, but I think the main issue of contention is just using copyrighted work to train generative AI without the author's permission regardless.

If they did ask permission, there would be no problem. But an author or artist should be given the choice if their work is going to be used to train an AI.

You make a lot of excellent points, but I think the main issue of contention is just using copyrighted work to train generative AI without the author’s permission regardless.

If I read a book at the library... and come up with an amazing revolutionary product. Then make a company and go on to make billions of dollar per year. The original book Author has no claim to my income.

There's no contention. This is just a money grab. Copyright doesn't disallow people from consuming the content as they please. It simply disallows someone to pass off the original works as your own when it's not.

Well yeah, art is made to be consumed by people.
And all art is inspired by other art. People write scifi books after reading other scifi books etc Thats not the issue here.

The issue is artists should be able to opt out of having their work taken and fed into a big project they have no control over.

Hard disagree. If my "company" from the previous post is a company that simply cribnotes and reviews books... You can't stop me from doing that either. Don't see people chomping the bit to take down other sites that have been doing this for decades.

Don’t see people chomping the bit to take down other sites that have been doing this for decades.

But this hasn't been happening for decades. Machine learning algorithms are an incredibly new way of processing data. All those scenarios you are talking about required a human to be the one doing the reading and summarising, which for most authors is fine, they expect people to read their work and summarise it, or quote it.

What they don't expect is for that work to be fed in full into a private companies data set to train a machine how to duplicate their content at speeds completely incomparable to human capabilities. We're talking about something completely new, completely unseen and you're disregarding the rights of those creators to not want their art, music or writing to be fed into the endless churn of data for these megacorporations.

Also, it's champing at the bit, not chomping.

Thank you for clarifying as I also had trouble recognizing the distinction at first.

and you’re disregarding the rights of those creators to not want their art, music or writing to be fed into the endless churn of data for these megacorporations.

... I don't see the authors having any rights at all once the work is publish and sold. That's the point of SELLING the book. It's letting people do with it what they please. That's called "ownership". If I want to buy every copy of your book that I can get my hands on in a store and set it on fire... You have no say in it, no matter what. I purchased the book. That's it. If I'm literally a Nazi reading the Diary of Anne Frank, nobody gets to tell me that I'm not allowed to check the book out of the library. Your "rights" to the copyright of the book are irrelevant to my rights of ownership of the book. Or the libraries rights to loan the book out to whomever.

Also, it’s champing at the bit, not chomping.

Really don't care about grammar nazi-ing... and tell that to my phones autocomplete.

The issue is artists should be able to opt out of having their work taken and fed into a big project they have no control over.

So in your opinion a should University have to ask each authors permission before using their work as a reference for each study run there one by one?

There is already a well established practise of getting permission in academic settings for reprinting written work/journal articles/etc. etc. And all published authors and academics understand that their work will be read, maybe used in an academic setting, summarized, debated, discussed, quoted, etc. Getting permission is definitely a thing in academia.

Sure, permission needs to be sought for reprinting. That's not what we are discussing though. I will just take your word on that second part because as far as I know none of my professors asked author permission before telling the class to read anything.

Ah, I misunderstood. Sorry, I thought you meant for reprinting since that involved copyright. (Like providing readers or ... whatever those bound copied texts are that profs hand out. Do they still do that? haha) But you're right, if it's just reading or discussing you don't need permission for that.

Also I want to clarify that I'm not against AI or machine learning generally, I just think creators need to be asked first before their work is given to any LLM, since it's the owners of the LLMs who profit from their use, not the creators whose work helped grow it. That's all I'm saying. I know AI's isn't going anywhere, and it can be used for a lot of great things. I just wish there were more protections for creators and their work, and that its growth was a little more ethical.

Such a well thought out and reasonable response. Thank you!

I generally do not agree with the law around intellectual property or even private property so we likely would not see eye to eye but I appreciate the chance to understand where you're coming from and what you believe.

I think the main issue of contention is just using copyrighted work to train generative AI without the author’s permission regardless.

You must define that in legal terms. This is a lawsuit, after all. It's not illegal to "just use" copyrighted work. The words "generative AI" are not in a federal or state bill anywhere in the US.

They can have an "issue of contention" all they want, but if they can't prove anything legally, they have nothing.

Exactly! You can't just be like "AI bad" in front of the judge ._.