Report: Potential NYT lawsuit could force OpenAI to wipe ChatGPT and start over

wanderingmagus@lemm.ee to Technology@lemmy.world – 483 points –
Report: Potential NYT lawsuit could force OpenAI to wipe ChatGPT and start over
arstechnica.com
156

You are viewing a single comment

It’s not like AI is using works to create something new. Chatgpt is similar to if someone were to buy 10 copies of different books, put them into 1 book as a collection of stories, then mass produce and sell the “new” book. It’s the same thing but much more convoluted.

Edit: to reply to your main point, people who make things should absolutely be able to impose limitations on how they are used. That’s what copyright is. Someone else made a song, can you freely use that song in your movie since you listened to it once? Not without their permission. You wrote a book, can I buy a copy and then use it to make more copies and sell? Not without your permission.

it’s not even close to that black and white… i’d say it’s a much more grey area:

possibly that you buy a bunch of books by the same author and emulate their style… that’s perfectly acceptable until you start using their characters

if you wrote a research paper about the linguistic and statistical information that makes an authors style, that also wouldn’t be a problem

so there’s something beyond just the authors “style” that they think is being infringed. we need to sort out exactly where the line is. what’s the extension to these 2 ideas that makes training an LLM a problem?

No, someone emulating someone else’s style is still going to have their own experiences, style, and creativity make their way into the book. They have an entire lifetime of “training data” to draw from. An AI that would “emulate” someone else’s style would really only be able to refer to the author’s books, or someone else’s books, therefore it’s stealing. Another example: if someone decided to remix different parts of a musician’s catalogue into one song, that would be a copyright infringement. AI adds nothing beyond what it’s trained on, therefore whatever it spits out is just other people’s works in a different way.

we output nothing other than what we’re trained on; the only difference is that we’re allowed to roam the world freely and consume whatever information we stumble on

what you say would be true if the LLM were only trained on content by the author seeking to say that their works had been infringed, however these LLMs include a lot of other data from public domain sources

one could consider these public domain sources and our experience of the world to be synonymous (and if you don’t i’d love to hear the distinction), in which case there’s some kind of a line that you seem to be drawing, and again i’d love to hear where you think that line is

is it just ratio? there’s precedent to that for sure: current law has fair use rules which stipulate things like “amount and substantiality”. in that case the question becomes one of defining the ratio. certainly the ratio of content that the author is referring to vs the content not trained by the author is minuscule

I agree with what you’re saying, and a model that is only trained on public domain would be fine. I think the very obvious line is that it’s a computer program. There seems to be a want for computers to be human but they aren’t. They don’t consume media for their own enjoyment, they are forced to do it so someone can sell the output as a product. You can’t compare the public domain to life.

i think the distinction that either side is seeing here is that you think humans are inherently different to a neural network, where i think that the only difference is in the complexity: that if we had a neural network at the same scale as the human brain, that there’s nothing stopping those electronic neurons from connecting and responding in a way that’s indistinguishable from a human

the fact that we’re not there yet i don’t see as particularly relevant, because we’re talking about concepts rather than specifics… of course a LLM doesn’t display the same characteristics as a human: it’s not of the same scale, and the training is different but functionally there’s nothing different between chemical neurons firing and neurons made of transistors firing

we learn in the same way: by reinforcing connections between our neurons

A few points:

  • Humans are more than just a brain. There’s the entire experience of ego, individualism, and body

  • Another massive distinction is autonomy and liberty, which no AI models currently possess.

  • We don’t know all there is to know about the human brain. We can’t say it is functionally equivalent to a neural network.

  • If complexity is irrelevant, then the simplest neural network trained on a single work of writing is equivalent to the most advanced models for the purposes of this discussion. Such a network would, again, output a copy of the work it was trained on

When we’ve developed a true self-aware AI that can move and think freely, the idea that there is little difference will have more weight to it.

Except it's not a collection of stories, it's an amalgamation - and at a very granular level at that. For instance, take the beginning of a sentence from the middle of first book, then switch to a sentence in the 3-rd, then finish with another part of the original sentence. Change some words here and there, add one for good measure (based on some sentence in the 7-th book). Then fix the grammar. All the while, keeping track that there's some continuity between the sentences you're stringing together.

That counts as "new" for me. And a lot of stuff humans do isn't more original.

The maybe bigger argument against free-reign training is that you're attributing personal rights to a language model. Also even people aren't completely free to derive things from memory (legally) which is why clean-room-design is a thing.

Chatgpt is similar to if someone were to buy 10 copies of different books, put them into 1 book as a collection of stories, then mass produce and sell the “new” book

That is not even close to correct. LLMs are little more than massively complex webs of statistics. Here’s a basic primer:

https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/

I’ve coded LLMs, I was just simplifying it because at its base level it’s not that different. It’s just much more convoluted as I said. They’re essentially selling someone else’s work as their own, it’s just run through a computer program first.

it’s nothing like that at all… if someone bought a book and produced a big table of words and the likelihood that the next word would be followed by another word, that’s what we’re talking about: it’s abstract statistics

actually, that’s not even what we’re talking about… we then take that word table and then combine it with hundreds of thousands of other tables until the original is so far from the original as to be completely untraceable back to the original work

If it were trained on a single book, the output would be the book. That’s the base level without all the convolution and that’s what we should be looking at. Do you also think someone should be able to train a model on your appearance and use it to sell images and videos, even though it’s technically not your likeness?