The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates

FatCat@lemmy.world to Technology@lemmy.world – 47 points –

Those claiming AI training on copyrighted works is "theft" misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they're extracting general patterns and concepts - the "Bob Dylan-ness" or "Hemingway-ness" - not copying specific text or images.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in "vector space". When generating new content, the AI isn't recreating copyrighted works, but producing new expressions inspired by the concepts it's learned.

This is fundamentally different from copying a book or song. It's more like the long-standing artistic tradition of being influenced by others' work. The law has always recognized that ideas themselves can't be owned - only particular expressions of them.

Moreover, there's precedent for this kind of use being considered "transformative" and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

While it's understandable that creators feel uneasy about this new technology, labeling it "theft" is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn't make the current use of copyrighted works for AI training illegal or unethical.

For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744

257

If ChatGPT was free I might see their point but it's not so no. If you're making money from someone's work you should pay them.

If they can base their business on stealing, then we can steal their AI services, right?

Pirating isn’t stealing but yes the collective works of humanity should belong to humanity, not some slimy cabal of venture capitalists.

Also, ingredients to a recipe aren't covered under copyright law.

ingredients to a recipe may well be subject to copyright, which is why food writers make sure their recipes are "unique" in some small way. Enough to make them different enough to avoid accusations of direct plagiarism.

E: removed unnecessary snark

I think there is some confusion here between copyright and patent, similar in concept but legally exclusive. A person can copyright the order and selection of words used to express a recipe, but the recipe itself is not copy, it can however fall under patent law if proven to be unique enough, which is difficult to prove.

So you can technically own the patent to a recipe keeping other companies from selling the product of a recipe, however anyone can make the recipe themselves, if you can acquire it and not resell it. However that recipe can be expressed in many different ways, each having their own copyright.

In what country is that?

Under US law, you cannot copyright recipes. You can own a specific text in which you explain the recipe. But anyone can write down the same ingredients and instructions in a different way and own that text.

Keep in my that "ingredients to a recipe" here refers to the literal physical ingredients, based on the context of the OP (where a sandwich shop owner can't afford to pay for their cheese).

While you can't copyright a recipe, you can patent the ingredients themselves, especially if you had a hand in doing R&D to create it. See PepsiCo sues four Indian farmers for using its patented Lay's potatoes.

No, you cannot patent an ingredient. What you can do - under Indian law - is get "protection" for a plant variety. In this case, a potato.

That law is called Protection of Plant Varieties and Farmers' Rights Act, 2001. The farmer in this case being PepsiCo, which is how they successfully sued these 4 Indian farmers.

Farmers' Rights for PepsiCo against farmers. Does that seem odd?

I've never met an intellectual property freak who didn't lie through his teeth.

Yes, that's exactly the point. It should belong to humanity, which means that anyone can use it to improve themselves. Or to create something nice for themselves or others. That's exactly what AI companies are doing. And because it is not stealing, it is all still there for anyone else. Unless, of course, the copyrightists get there way.

How do you feel about Meta and Microsoft who do the same thing but publish their models open source for anyone to use?

Well how long to you think that's going to last? They are for-profit companies after all.

Those aren't open source, neither by the OSI's Open Source Definition nor by the OSI's Open Source AI Definition.

The important part for the latter being a published listing of all the training data. (Trainers don't have to provide the data, but they must provide at least a way to recreate the model given the same inputs).

Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.

They are model-available if anything.

For the purposes of this conversation. That's pretty much just a pedantic difference. They are paying to train those models and then providing them to the public to use completely freely in any way they want.

It would be like developing open source software and then not calling it open source because you didn't publish the market research that guided your UX decisions.

You said open source. Open source is a type of licensure.

The entire point of licensure is legal pedantry.

And as far as your metaphor is concerned, pre-trained models are closer to pre-compiled binaries, which are expressly not considered Open Source according to the OSD.

You said open source. Open source is a type of licensure.

The entire point of licensure is legal pedantry.

No. Open source is a concept. That concept also has pedantic legal definitions, but the concept itself is not inherently pedantic.

And as far as your metaphor is concerned, pre-trained models are closer to pre-compiled binaries, which are expressly not considered Open Source according to the OSD.

No, they're not. Which is why I didn't use that metaphor.

A binary is explicitly a black box. There is nothing to learn from a binary, unless you explicitly decompile it back into source code.

In this case, literally all the source code is available. Any researcher can read through their model, learn from it, copy it, twist it, and build their own version of it wholesale. Not providing the training data, is more similar to saying that Yuzu or an emulator isn't open source because it doesn't provide copyrighted games. It is providing literally all of the parts of it that it can open source, and then letting the user feed it whatever training data they are allowed access to.

Look... All I have to say is... Support the Internet Archive!

(please)

Heh. Funny that this comment is uncontroversial. The Internet Archive supports Fair Use because, of course, it does.

This is from a position paper explicitly endorsed by the IA:

Based on well-established precedent, the ingestion of copyrighted works to create large language models or other AI training databases generally is a fair use.

By

  • Library Copyright Alliance
  • American Library Association
  • Association of Research Libraries

The argument that these models learn in a way that's similar to how humans do is absolutely false, and the idea that they discard their training data and produce new content is demonstrably incorrect. These models can and do regurgitate their training data, including copyrighted characters.

And these things don't learn styles, techniques, or concepts. They effectively learn statistical averages and patterns and collage them together. I've gotten to the point where I can guess what model of image generator was used based on the same repeated mistakes that they make every time. Take a look at any generated image, and you won't be able to identify where a light source is because the shadows come from all different directions. These things don't understand the concept of a shadow or lighting, they just know that statistically lighter pixels are followed by darker pixels of the same hue and that some places have collections of lighter pixels. I recently heard about an ai that scientists had trained to identify pictures of wolves that was working with incredible accuracy. When they went in to figure out how it was identifying wolves from dogs like huskies so well, they found that it wasn't even looking at the wolves at all. 100% of the images of wolves in its training data had snowy backgrounds, so it was simply searching for concentrations of white pixels (and therefore snow) in the image to determine whether or not a picture was of wolves or not.

Basing your argument around how the model or training system works doesn't seem like the best way to frame your point to me. It invites a lot of mucking about in the details of how the systems do or don't work, how humans learn, and what "learning" and "knowledge" actually are.

I'm a human as far as I know, and it's trivial for me to regurgitate my training data. I regularly say things that are either directly references to things I've heard, or accidentally copy them, sometimes with errors.
Would you argue that I'm just a statistical collage of the things I've experienced, seen or read? My brain has as many copies of my training data in it as the AI model, namely zero, but "Captain Picard of the USS Enterprise sat down for a rousing game of chess with his friend Sherlock Holmes, and then Shakespeare came in dressed like Mickey mouse and said 'to be or not to be, that is the question, for tis nobler in the heart' or something". Direct copies of someone else's work, as well as multiple copyright infringements.
I'm also shit at drawing with perspective. It comes across like a drunk toddler trying their hand at cubism.

Arguing about how the model works or the deficiencies of it to justify treating it differently just invites fixing those issues and repeating the same conversation later. What if we make one that does work how humans do in your opinion? Or it properly actually extracts the information in a way that isn't just statistically inferred patterns, whatever the distinction there is? Does that suddenly make it different?

You don't need to get bogged down in the muck of the technical to say that even if you conceed every technical point, we can still say that a non-sentient machine learning system can be held to different standards with regards to copyright law than a sentient person. A person gets to buy a book, read it, and then carry around that information in their head and use it however they want. Not-A-Person does not get to read a book and hold that information without consent of the author.
Arguing why it's bad for society for machines to mechanise the production of works inspired by others is more to the point.

Computers think the same way boats swim. Arguing about the difference between hands and propellers misses the point that you don't want a shrimp boat in your swimming pool. I don't care why they're different, or that it technically did or didn't violate the "free swim" policy, I care that it ruins the whole thing for the people it exists for in the first place.

I think all the AI stuff is cool, fun and interesting. I also think that letting it train on everything regardless of the creators wishes has too much opportunity to make everything garbage. Same for letting it produce content that isn't labeled or cited.
If they can find a way to do and use the cool stuff without making things worse, they should focus on that.

Arguing why it's bad for society for machines to mechanise the production of works inspired by others is more to the point.

I agree, but the fact that shills for this technology are also wrong about it is at least interesting.

Rhetorically speaking, I don't know if that's useless.

I don't care why they're different, or that it technically did or didn't violate the "free swim" policy,

I do like this point a lot.

If they can find a way to do and use the cool stuff without making things worse, they should focus on that.

I do miss when the likes of cleverbot was just a fun novelty on the Internet.

Even if they learned exactly like humans do, like so fucking what, right!? Humans have to pay EXORBITANT fees for higher education in this country. Arguing that your bot gets socialized education before the people do is fucking absurd.

That seems more like an argument for free higher education rather than restricting what corpuses a deep learning model can train on

Porque no los dos? Allowing major corps to put even more downward pressure on workers doesn't help anyone but the rich. LLMs aren't going to save the world or become sentient.

I am also not really getting the argument. If I as a human want to learn a subject from a book I buy it ( or I go to a library who paid for it). If it’s similar to how humans learn, it should cost equally much.

The issue is of course that it’s not at all similar to how humans learn. It needs VASTLY more data to produce something even remotely sensible. Develop AI that’s truly transformative, by making it as efficient as humans are in learning, and the cost of paying for copyright will be negligible.

If I as a human want to learn a subject from a book, I buy it

xD
That's good.

Dude never heard of a library. I only bought a handful of books during my degree, I would've been homeless if I had to buy a copy of every learning source

That was literally in my post. Obviously, in that case the library pays for copyright

If I as a human want to learn a subject from a book I buy it ( or I go to a library who paid for it). If it’s similar to how humans learn, it should cost equally much.

You're on Lemmy where people casually says "piracy is morally the right thing to do", so I'm not sure this argument works on this platform.

I know my way around the Jolly Roger myself. At the same time using copyrighted materials in a commercial setting (as OpenAI does) shouldn’t be free.

Only if they are selling the output. I see it as more they are selling access to the service on a server farm, since running ChatGPT is not cheap.

The usual cycle of tech-bro capitalism would put them currently on the early acquire market saturation stage. So it's unlikely that they are currently charging what they will when they are established and have displaced lots of necessary occupations.

Imagine if you had blinders and earmuffs on for most of the day, and only once in a while were you allowed to interact with certain people and things. Your ability to communicate would be truncated to only what you were allowed to absorb.

Devil's Advocate:

How do we know that our brains don't work the same way?

Why would it matter that we learn differently than a program learns?

Suppose someone has a photographic memory, should it be illegal for them to consume copyrighted works?

Because we're talking pattern recognition levels of learning. At best, they're the equivalent of parrots mimicking human speech. They take inputs and output data based on the statistical averages from their training sets - collaging pieces of their training into what they think is the right answer. And I use the word think here loosely, as this is the exact same process that the Gaussian blur tool in Photoshop uses.

This matters in the context of the fact that these companies are trying to profit off of the output of these programs. If somebody with an eidetic memory is trying to sell pieces of works that they've consumed as their own - or even somebody copy-pasting bits from Clif Notes - then they should get in trouble; the same as these companies.

Given A and B, we can understand C. But an LLM will only be able to give you AB, A(b), and B(a). And they've even been just spitting out A and B wholesale, proving that they retain their training data and will regurgitate the entirety of copyrighted material.

Here's an experiment for you to try at home. Ask an AI model a question, copy a sentence or two of what they give back, and paste it into a search engine. The results may surprise you.

And stop comparing AI to humans but then giving AI models more freedom. If I wrote a paper I'd need to cite my sources. Where the fuck are your sources ChatGPT? Oh right, we're not allowed to see that but you can take whatever you want from us. Sounds fair.

Not to fully argue against your point, but I do want to push back on the citations bit. Given the way an LLM is trained, it's not really close to equivalent to me citing papers researched for a paper. That would be more akin to asking me to cite every piece of written or verbal media I've ever encountered as they all contributed in some small way to way that the words were formulated here.

Now, if specific data were injected into the prompt, or maybe if it was fine-tuned on a small subset of highly specific data, I would agree those should be cited as they are being accessed more verbatim. The whole "magic" of LLMs was that it needed to cross a threshold of data, combined with the attentional mechanism, and then the network was pretty suddenly able to maintain coherent sentences structure. It was only with loads of varied data from many different sources that this really emerged.

It's not a breach of copyright or other IP law not to cite sources on your paper.

Getting your paper rejected for lacking sources is also not infringing in your freedom. Being forced to pay damages and delete your paper from any public space would be infringement of your freedom.

I’m pretty sure that it’s true that citing sources isn’t really relevant to copyright violation, either you are violating or not. Saying where you copied from doesn’t change anything, but if you are using some ideas with your own analysis and words it isn’t a violation either way.

With music this often ends up in civil court. Pretty sure the same can in theory happen for written texts, but the commercial value of most written texts is not worth the cost of litigation.

I mean, you're not necessarily wrong. But that doesn't change the fact that it's still stealing, which was my point. Just because laws haven't caught up to it yet doesn't make it any less of a shitty thing to do.

It's not stealing, its not even 'piracy' which also is not stealing.

Copyright laws need to be scaled back, to not criminalize socially accepted behavior, not expand.

The original source material is still there. They just made a copy of it. If you think that's stealing then online piracy is stealing as well.

Well they make a profit off of it, so yes. I have nothing against piracy, but if you're reselling it that's a different story.

But piracy saves you money which is effectively the same as making a profit. Also, it's not just that they're selling other people's work for profit. You're also paying for the insane amount of computing power it takes to train and run the AI plus salaries of the workers etc.

When I analyze a melody I play on a piano, I see that it reflects the music I heard that day or sometimes, even music I heard and liked years ago.

Having parts similar or a part that is (coincidentally) identical to a part from another song is not stealing and does not infringe upon any law.

You guys are missing a fundamental point. The copyright was created to protect an author for specific amount of time so somebody else doesn't profit from their work essentially stealing their deserved revenue.

LLM AI was created to do exactly that.

This is the catch with OPs entire statement about transformation. Their premise is flawed, because the next most likely token is usually the same word the author of a work chose.

And that's kinda my point. I understand that transformation is totally fine but these LLM literally copy and paste shit. And that's still if you are comparing AI to people which I think is completely ridiculous. If anything these things are just more complicated search engines with half the usefulness. If I search online about how to change a tire I can find some reliable sources to do so. If I ask AI how to change a tire it would just spit something out that might not even be accurate and I'd have to search again afterwards just to make sure what it told me was even accurate.

It's just a word calculator based on information stolen from people without their consent. It has no original thought process so it has no way to transform anything. All it can do is copy and paste in different combinations.

Generative AI does not work like this. They're not like humans at all, it will regurgitate whatever input it receives, like how Google can't stop Gemini from telling people to put glue in their pizza. If it really worked like that, there wouldn't be these broad and extensive policies within tech companies about using it with company sensitive data like protection compliances. The day that a health insurance company manager says, "sure, you can feed Chat-GPT medical data" is the day I trust genAI.

This process is akin to how humans learn...

I'm so fucking sick of people saying that. We have no fucking clue how humans LEARN. Aka gather understanding aka how cognition works or what it truly is. On the contrary we can deduce that it probably isn't very close to human memory/learning/cognition/sentience (any other buzzword that are stands-ins for things we don't understand yet), considering human memory is extremely lossy and tends to infer its own bias, as opposed to LLMs that do neither and religiously follow patters to their own fault.

It's quite literally a text prediction machine that started its life as a translator (and still does amazingly at that task), it just happens to turn out that general human language is a very powerful tool all on its own.

I could go on and on as I usually do on lemmy about AI, but your argument is literally "Neural network is theoretically like the nervous system, therefore human", I have no faith in getting through to you people.

Even worse is, in order to further humanize machine learning systems, they often give them human-like names.

Not even stealing cheese to run a sandwich shop.

Stealing cheese to melt it all together and run a cheese shop that undercuts the original cheese shops they stole from.

"but how are we supposed to keep making billions of dollars without unscrupulous intellectual property theft?! line must keep going up!!"

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.

Machine learning algorithms are not people and are not ingesting these works the same way a person does. This argument is brought up all the time and just doesn't ring true. You're defending the unethical use of copyrighted works by a giant corporation with a metaphor that doesn't have any bearing on reality; in an age where artists are already shamefully undervalued. Creating art is a human process with the express intent of it being enjoyed by other humans. Having an algorithm do it is removing the most important part of art; the humanity.

The problem with your argument is that it is 100% possible to get ChatGPT to produce verbatim extracts of copyrighted works. This has been suppressed by OpenAI in a rather brute force kind of way, by prohibiting the prompts that have been found so far to do this (e.g. the infamous "poetry poetry poetry..." ad infinitum hack), but the possibility is still there, no matter how much they try to plaster over it. In fact there are some people, much smarter than me, who see technical similarities between compression technology and the process of training an LLM, calling it a "blurry JPEG of the Internet"... the point being, you wouldn't allow distribution of a copyrighted book just because you compressed it in a ZIP file first.

The problem with your argument is that it is 100% possible to get ChatGPT to produce verbatim extracts of copyrighted works.

Exactly! This is the core of the argument The New York Times made against OpenAI. And I think they are right.

I agree. You can't just dismiss the problem saying it's "just data represented in vector space" and on the other hand not be able properly censor the models and require AI safety research. If you don't know exactly what's going on inside, you also can't claim that copyright is not being violated.

It honestly blows my mind that people look at a neutral network that's even capable of recreating short works it was trained on without having access to that text during generation... and choose to focus on IP law.

Right! Like if we could honestly further enhance that feature its an incredible increase in compression tech!

This would be a good point, if this is what the explicit purpose of the AI was. Which it isn't. It can quote certain information verbatim despite not containing that data verbatim, through the process of learning, for the same reason we can.

I can ask you to quote famous lines from books all day as well. That doesn't mean that you knowing those lines means you infringed on copyright. Now, if you were to put those to paper and sell them, you might get a cease and desist or a lawsuit. Therein lies the difference. Your goal would be explicitly to infringe on the specific expression of those words. Any human that would explicitly try to get an AI to produce infringing material... would be infringing. And unknowing infringement... well there are countless court cases where both sides think they did nothing wrong.

You don't even need AI for that, if you followed the Infinite Monkey Theorem and just happened to stumble upon a work falling under copyright, you still could not sell it even if it was produced by a purely random process.

Another great example is the Mona Lisa. Most people know what it looks like and if they had sufficient talent could mimic it 1:1. However, there are numerous adaptations of the Mona Lisa that are not infringing (by today's standards), because they transform the work to the point where it's no longer the original expression, but a re-expression of the same idea. Anything less than that is pretty much completely safe infringement wise.

You're right though that OpenAI tries to cover their ass by implementing safeguards. Which is to be expected because it's a legal argument in court that once they became aware of situations they have to take steps to limit harm. They can indeed not prevent it completely, but it's the effort that counts. Practically none of that kind of moderation is 100% effective. Otherwise we'd live in a pretty good world.

Y'all should really stop expecting people to buy into the analogy between human learning and machine learning i.e. "humans do it, so it's okay if a computer does it too". First of all there are vast differences between how humans learn and how machines "learn", and second, it doesn't matter anyway because there is lots of legal/moral precedent for not assigning the same rights to machines that are normally assigned to humans (for example, no intellectual property right has been granted to any synthetic media yet that I'm aware of).

That said, I agree that "the model contains a copy of the training data" is not a very good critique--a much stronger one would be to simply note all of the works with a Creative Commons "No Derivatives" license in the training data, since it is hard to argue that the model checkpoint isn't derived from the training data.

Equating LLMs with compression doesn't make sense. Model sizes are larger than their training sets. if it requires "hacking" to extract text of sufficient length to break copyright, and the platform is doing everything they can to prevent it, that just makes them like every platform. I can download © material from YouTube (or wherever) all day long.

Model sizes are larger than their training sets

Excuse me, what? You think Huggingface is hosting 100's of checkpoints each of which are multiples of their training data, which is on the order of terabytes or petabytes in disk space? I don't know if I agree with the compression argument, myself, but for other reasons--your retort is objectively false.

The issue isn't that you can coax AI into giving away unaltered copyrighted books out of their trunk, the issue is that if you were to open the hood, you'd see that the entire engine is made of unaltered copyrighted books.

All those "anti hacking" measures are just there to obfuscate the fact that that the unaltered works are being in use and recallable at all times.

This is an inaccurate understanding of what's going on. Under the hood is a neutral network with weights and biases, not a database of copyrighted work. That neutral network was trained on a HEAVILY filtered training set (as mentioned above, 45 terabytes was reduced to 570 GB for GPT3). Getting it to bug out and generate full sections of training data from its neutral network is a fun parlor trick, but you're not going to use it to pirate a book. People do that the old fashioned way by just adding type:pdf to their common web search.

Again: nobody is complaining that you can make AI spit out their training data because AI is the only source of that training data. That is not the issue and nobody cares about AI as a delivery source of pirated material. The issue is that next to the transformed output, the not-transformed input is being in use in a commercial product.

ML techniques have been very useful in compression, yes, but it's sort of nuts to say that a data structure that encodes only (sometimes overly so for certain regions of its latent space/embedding space/semantics space/whatever you want to call it right now) relationships between values rather than value sequences themselves as storing contiguous copyright protected works is storing partiularized creative works in particularly identifiable manner.

Except that, again, as is literally written in the comment you're directly replying to, it has been shown that AI can reproduce copyrightable works word for word, showing that it objectively and necessarily is storing particular creative works in a particularly identifiable manner, whether or not that manner is yet known to humans.

No, it isn't storing that information in that sequence. What is happening is that it is overly encoding those particular sequential relationships along some arbitrary but tightly mapped semantic concepts represented by dimensions in a massive vector space. It is storing copies of the information on the way that inadvertent copying of music might be based on "memorized" music listened to by the infringing artist in the past.

Not what I said. I used the exact language the above commenter used because it was specific and accurate. Also, inadvertent copyright violation is still copyright violation under US law. I'm not the biggest fan of every application of that law, but the ability to keep large corporations from ripping off small artists and creators is one that I think is good and useful under the global economic system we live under currently.

Yes, inadvertent copying is still copying, but it would be copying in the output and is not evidence of copying happening in the creation of the model. That was why I used the music example, because it is rather probative of where there could be grounds for copyright infringement related to these model architectures. This may not seem an important distinction, but it has significant consequences on who is ultimately liable and how.

It's called learning, and I wish people did more of it.

You don't learn by memorizing and reproducing works, you learn by understanding the concepts in various works and producing new works that are combinations of the ideas in those other works. AI doesn't understand, and it has been shown to be able to reproduce works, so I think it's fair to say that it's doing a lot of "memorizing" and therefore plagiarism.

Calling what attention transformers do memorization is wildly inaccurate.

*Unless we're talking about semantic memory.

Is it though? People memorize things very differently than computers do, but the actual mechanism of storage isn't particularly important. What's important is the net result. Whether it uses baysian networks (what we used in class for small-scale NLP), neural networks (what I assume LLMs use), or something else doesn't particularly matter.

For example, a search engine typically only stores keywords and relationships, so there's no way for it to reproduce an entire work (ignoring, of course, the "caching" features some search engines have). All it does is associate keywords with source material, so there's a strong argument that it falls under fair use.

LLMs, on the other hand, process entire works and keep more than just keywords, and they store it in such a way that entire works can be recovered if coaxed. My understanding is that they break up words into something like sets of phonemes, and then queries do a similar break-up as input to the neural network to produce an output, which is then reassembled into text. But that's my relatively naive understanding of how it all works (I've only done university level NLP, and that was years ago), but again, that's really not the point here. The point is that it uses a lot more of the work than the typical understanding of "fair use," and if copyrighted works can be reproduced by it, then the copyrighted work is "stored" in some fashion, so it can be thought of as a really complex form of compression, with tricky retrieval mechanisms. So in layman's terms, it's "memorizing" entire works in a way not entirely unlike a "mind palace", and to reproduce a given work, you need the right input to follow the right steps, but a slightly different input will lead to a very different output (i.e. maybe something with similar content, but no copyright violations).

What's at issue isn't whether the LLM is likely to reproduce entire works, but whether it can and does, which would mean it's violating fair use standards.

"This process is akin to how humans learn... The AI discards the original text, keeping only abstract representations..."

Now I sail the high seas myself, but I don't think Paramount Studios would buy anyone's defence they were only pirating their movies so they can learn the general content so they can produce their own knockoff.

Yes artists learn and inspire each other, but more often than not I'd imagine they consumed that art in an ethical way.

The whole point of copyright in the first place, is to encourage creative expression, so we can have human culture and shit.

The idea of a "teensy" exception so that we can "advance" into a dark age of creative pointlessness and regurgitated slop, where humans doing the fun part has been made "unnecessary" by the unstoppable progress of "thinking" machines, would be hilarious, if it weren't depressing as fuck.

The whole point of copyright in the first place, is to encourage creative expression, so we can have human culture and shit.

I feel like that purpose has already been undermined by various changes to copyright law since its inception, such as DMCA and lengthening copyright term from 14 years to 95. Freedom to remix existing works is an important part of creative expression which current law stifles for any original work that releases in one person's lifespan. (Even Disney knew this: the animated Pinocchio movie wouldn't exist if copyright could last more than 56 years then)

Either way, giving bots the 'right' to remix things that were just made less than a year ago while depriving humans the right to release anything too similar to a 94 year old work seems ridiculous on both ends.

The whole point of copyright in the first place, is to encourage creative expression

...within a capitalistic framework.

Humans are creative creatures and will express themselves regardless of economic incentives. We don't have to transmute ideas into capital just because they have "value".

Sorry buddy, but that capitalistic framework is where we all have to exist for the forseeable future.

Giving corporations more power is not going to help us end that.

I'd agree, but here's one issue with that: we live in reality, not in a post-capitalist dreamworld.

Creativity takes up a lot of time from the individual, while a lot of us are already working two or even three jobs, all on top of art. A lot of us have to heavily compromise on a lot of things, or even give up our dreams because we don't have the time for that. Sure, you get the occasional "legendary metal guitarist practiced so much he even went to the toilet with a guitar", but many are so tired from their main job, they instead just give up.

Developing game while having a full-time job feels like crunching 24/7, while only around 4 is going towards that goal, which includes work done on my smartphone at my job. Others just outright give up. This shouldn't be the normal for up and coming artists.

Honestly, that's why open source AI is such a good thing for small creatives. Hate it or love it, anyone wielding AI with the intention to make new expression will be much more safe and efficient to succeed until they can grow big enough to hire a team with specialists. People often look at those at the top but ignore the things that can grow from the bottom and actually create more creative expression.

One issue is, many open source AI also tries to ape whatever the big ones are doing at the moment, with the most outrageous example is one that generates a timelapse for AI art.

There's also tools that especially were created with artists in mind, but they're less popular due to the average person cannot use it as easily as the prompter machines, nor promise the end of "people with fake jobs" (boomers like generative AI for this reason).

You're not wrong.

The kind of art humanity creates is skewed a lot by the need for it to be marketable, and then sold in order to be worth doing.

But copyright is better than nothing, and this exemption would straight up be even worse than nothing.

Humans are indeed creative by nature, we like making things. What we don't naturally do is publish, broadcast and preserve our work.

Society is iterative. What we build today, we build mostly out of what those who came before us built. We tell our versions of our forefathers' stories, we build new and improved versions of our forefather's machines.

A purely capitalistic society would have infinite copyright and patent durations, this idea is mine, it belongs to me, no one can ever have it, my family and only my family will profit from it forever. Nothing ever improves because improving on an old idea devalues the old idea, and the landed gentry can't allow that.

A purely communist society immediately enters whatever anyone creates into the public domain. The guy who revolutionizes energy production making everyone's lives better is paid the same as a janitor. So why go through all the effort? Just sweep the floors.

At least as designed, our idea of copyright is a compromise. If you have an idea, we will grant you a limited time to exclusively profit from your idea. You may allow others to also profit at your discretion; you can grant licenses, but that's up to you. After the time is up, your idea enters the public domain, and becomes the property and heritage of humanity, just like the Epic of Gilgamesh. Others are free to reproduce and iterate upon your ideas.

I think you have your janitor example backwards. Spending my time revolutionizing energy productions sounds much more enjoyable than sweeping floors. Same with designing an effective floor sweeping robot.

That’s the reason we got copyright, but I don’t think that’s the only reason we could want copyright.

Two good reasons to want copyright:

  1. Accurate attribution
  2. Faithful reproduction

Accurate attribution:

Open source thrives on the notion that: if there’s a new problem to be solved, and it requires a new way of thinking to solve it, someone will start a project whose goal is not just to build new tools to solve the problem but also to attract other people who want to think about the problem together.

If anyone can take the codebase and pretend to be the original author, that will splinter the conversation and degrade the ability of everyone to find each other and collaborate.

In the past, this was pretty much impossible because you could check a search engine or social media to find the truth. But with enshittification and bots at every turn, that looks less and less guaranteed.

Faithful reproduction:

If I write a book and make some controversial claims, yet it still provokes a lot of interest, people might be inclined to publish slightly different versions to advance their own opinions.

Maybe a version where I seem to be making an abhorrent argument, in an effort to mitigate my influence. Maybe a version where I make an argument that the rogue publisher finds more palatable, to use my popularity to boost their own arguments.

This actually happened during the early days of publishing, by the way! It’s part of the reason we got copyright in the first place.

And again, it seems like this would be impossible to get away with now, buuut… I’m not so sure anymore.

Personally:

I favor piracy in the sense that I think everyone has a right to witness culture even if they can’t afford the price of admission.

And I favor remixing because the cultural conversation should be an active read-write two-way street, no just passive consumption.

But I also favor some form of licensing, because I think we have a duty to respect the integrity of the work and the voice of the creator.

I think AI training is very different from piracy. I’ve never downloaded a mega pack of songs and said to my friends “Listen to what I made!” I think anyone who compares OpenAI to pirates (favorably) is unwittingly helping the next set of feudal tech lords build a wall around the entirety of human creativity, and they won’t realize their mistake until the real toll booths open up.

I think AI training is very different from piracy. I’ve never downloaded a mega pack of songs and said to my friends “Listen to what I made!”

I've never done this. But I have taken lessons from people for instruments, listened to bands I like, and then created and played songs that certainly are influences by all of that. I've also taken a lot of art classes, and studied other people's painting styles and then created things from what I've learned, and said "look at what I made!" Which is far more akin to what AI is doing that what you are implying here.

So what if its closer? Its still not an accurate description, because thats not what AI does.

Because what they are describing is just straight up theft, while what I describes is so much closer to how one trains and ai. I'm afraid that what comes out of this ai hysteria is that copyright gets more strict and humans copying style even becomes illegal.

I’m sympathetic to the reflexive impulse to defend OpenAI out of a fear that this whole thing results in even worse copyright law.

I, too, think copyright law is already smothering the cultural conversation and we’re potentially only a couple of legislative acts away from having “property of Disney” emblazoned on our eyeballs.

But don’t fall into their trap of seeing everything through the lens of copyright!

We have other laws!

We can attack OpenAI on antitrust, likeness rights, libel, privacy, and labor laws.

Being critical of OpenAI doesn’t have to mean siding with the big IP bosses. Don’t accept that framing.

Well that all doesn't matter much. If AI is used to cause harm, it should be regulated. If that frustrates you then go get the laws changed that allow shitty companies to ruin good ideas.

Disagree. These companies are exploiting an unfair power dynamic they created that people can't say no to, to make an ungodly amount of money for themselves without compensating people whose data they took without telling them. They are not creating a cool creative project that collaboratively comments on or remixes what other people have made, they are seeking to gobble up and render irrelevant everything that they can, for short term greed. That's not the scenario these laws were made for. AI hurts people who have already been exploited and industries that have already been decimated. Copyright laws were not written with this kind of thing in mind. There are potentially cool and ethical uses for AI models, but open ai and google are just greed machines.

Edited * THRICE because spelling. oof.

Generative AI is not 'influenced' by other people's work the way humans are. A human musician might spend years covering songs they like and copying or emulating the style, until they find their own style, which may or may not be a blend of their influences, but crucially, they will usually add something. AI does not do that. The idea that AI functions the same as human artists, by absorbing influences and producing their own result, is not only fundamentally false, it is dangerously misleading. To portray it as 'not unethical' is even more misleading.

Production AI is highly tuned by training data selection and human feedback. Every model has its own style that many people helped tune. In the open model world there are thousands of different models targeting various styles. Waifu Diffusion and GPT-4chan, for example.

Sure, training data selection impacts the output. If you feed an AI nothing but anime, the images it produces will look like anime. If all it knows is K-pop, then the music it puts out will sound like K-pop. Tweaking a computational process through selective input is not the same as a human being actively absorbing stimuli and forming their own, unique response.

AI doesn't have an innate taste or feeling for what it likes. It won't walk into a second hand CD store, browse the boxes, find something that's intriguing and check it out. It won't go for a walk and think "I want to take a photo of that tree there in the open field". It won't see or hear a piece of art and think "I'd like to be learn how to paint/write/play an instrument like that". And it will never make art for the sake of making art, for the pure enjoyment that is the process of creating something, irrespective of who wants to see or hear the result. All it is designed to do is regurgitate an intersection of what it knows that best suits the parameters of a given request (aka prompt). Actively learning, experimenting, practicing techniques, trying to emulate specific techniques of someone else - making art for the sake of making art - is a key component to humans learning from others and being influenced by others.

So the process of human learning and influencing, and the selective feeding of data to an AI to 'tune' its output are entirely different things that cannot and should not be compared.

As others have said, it isn't inspired always, sometimes it literally just copies stuff.

This feels like it was written by someone who invested their money in AI companies because they're worried about their stocks

Sometimes I've noticed Google's AI overview is a nearly word for word copy of the highest reddit result, or any result.

I mean, that's because googles AI over view is designed to summarize search results on a topic. On one hand that reduces the degree to which it will simply hallucinate, on the other sometimes the top search result is already as concise as it can be at the target grade level of writing.

God its so useless

The only times I've had it be remotely helpful is when you want something specific that's going to appear near the top of search results and is also likely to be buried in a bunch of irrelevant faff. Which is to say that occasionally "search for X and summarize the top result" is a useful tool but not often enough for them to front and center it like they do.

For example recipes. You can't copyright a recipe, so recipes tend to be buried in a lot of crap that isn't the actual recipe.

You know, those obsessed with pushing AI would do a lot better if they dropped the patronizing tone in every single one of their comments defending them.

It's always fun reading "but you just don't understand".

On the other hand, it's hard to have a serious discussion with people who insist that building a LLM or diffusion model amounts to copying pieces of material into an obfuscated database. And then having to deal with the typical reply after explanation is attempted of "that isn't the point!" but without any elaboration strongly implies to me that some people just want to be pissy and don't want to hear how they may have been manipulated into taking a pro-corporate, hyper-capitalist position on something.

I love that the collectivist ideal of sharing all that we've created for the betterment of humanity is being twisted into this disgusting display of corporate greed and overreach. OpenAI doesn't need shit. They don't have an inherent right to exist but must constantly make the case for it's existence.

The bottom line is that if corporations need data that they themselves cannot create in order to build and sell a service then they must pay for it. One way or another.

I see this all as parallels with how aquifers and water rights have been handled and I'd argue we've fucked that up as well.

Training data IS a massive industry already. You don't see it because you probably don't work in a field directly dealing with it. I work in medtech and millions and millions of dollars are spent acquiring training data every year. Should some new unique IP right be found on using otherwise legally rendered data to train AI, it is almost certainly going to be contracted away to hosting platforms via totally sound ToS and then further monetized such that only large and we'll funded corporate entities can utilize it.

unique

"unique new IP right?" Bruh you're talking about basic fucking intellectual property law. Just because someone posts something publicly on the internet doesn't mean that it can be used for whatever anybody likes. This is so well-established, that every major art gallery and social media website has a clause in their terms of service stating that you are granting them a license to redistribute that content. And most websites also explicitly state that when you upload your work to their site that you still retain your copyright of that work.

For example (emphasis mine):

FurAffinity:

4.1 When you upload content to Fur Affinity via our services, you grant us a non-exclusive, worldwide, royalty-free, sublicensable, transferable right and license to use, host, store, cache, reproduce, publish, display (publicly or otherwise), perform (publicly or otherwise), distribute, transmit, modify, adapt, and create derivative works of, that content. These permissions are purely for the limited purposes of allowing us to provide our services in accordance with their functionality (hosting and display), improve them, and develop new services. These permissions do not transfer the rights of your content or allow us to create any deviations of that content outside the aforementioned purposes.

Inkbunny:

Posting Content

You keep copyright of any content posted to Inkbunny. For us to provide these services to you, you grant Inkbunny non-exclusive, royalty-free license to use and archive your artwork in accordance with this agreement.

When you submit artwork or other content to Inkbunny, you represent and warrant that:

* you own copyright to the content, or that you have permission to use the content, and that you have the right to display, reproduce and sell the content. You license Inkbunny to use the content in accordance with this agreement;

DeviantArt:

  1. Copyright in Your Content

DeviantArt does not claim ownership rights in Your Content. For the sole purpose of enabling us to make your Content available through the Service, you grant DeviantArt a non-exclusive, royalty-free license to reproduce, distribute, re-format, store, prepare derivative works based on, and publicly display and perform Your Content. Please note that when you upload Content, third parties will be able to copy, distribute and display your Content using readily available tools on their computers for this purpose although other than by linking to your Content on DeviantArt any use by a third party of your Content could violate paragraph 4 of these Terms and Conditions unless the third party receives permission from you by license.

e621:

When you upload content to e621 via our services, you grant us a non-exclusive, worldwide, royalty-free, sublicensable, transferable right and license to use, host, store, cache, reproduce, publish, display (publicly or otherwise), perform (publicly or otherwise), distribute, transmit, downsample, convert, adapt, and create derivative works of, that content. These permissions are purely for the limited purposes of allowing us to provide our services in accordance with their functionality (hosting and display), improve them, and develop new services. These permissions do not transfer the rights of your content or allow us to create any deviations of that content outside the aforementioned purposes.

Xitter:

Your Rights and Grant of Rights in the Content

You retain your rights to any Content you submit, post or display on or through the Services. What’s yours is yours — you own your Content (and your incorporated audio, photos and videos are considered part of the Content).

By submitting, posting or displaying Content on or through the Services, you grant us a worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content in any and all media or distribution methods now known or later developed (for clarity, these rights include, for example, curating, transforming, and translating). This license authorizes us to make your Content available to the rest of the world and to let others do the same.

Facebook:

The permissions you give us We need certain permissions from you to provide our services:

  • Permission to use content you create and share: Some content that you share or upload, such as photos or videos, may be protected by intellectual property laws.

  • You retain ownership of the intellectual property rights (things like copyright or trademarks) in any such content that you create and share on Facebook and other Meta Company Products you use. Nothing in these Terms takes away the rights you have to your own content. You are free to share your content with anyone else, wherever you want.

  • However, to provide our services we need you to give us some legal permissions (known as a "license") to use this content. This is solely for the purposes of providing and improving our Products and services as described in Section 1 above.

  • Specifically, when you share, post, or upload content that is covered by intellectual property rights on or in connection with our Products, you grant us a non-exclusive, transferable, sub-licensable, royalty-free, and worldwide license to host, use, distribute, modify, run, copy, publicly perform or display, translate, and create derivative works of your content (consistent with your privacy and application settings). This means, for example, that if you share a photo on Facebook, you give us permission to store, copy, and share it with others (again, consistent with your settings) such as Meta Products or service providers that support those products and services. This license will end when your content is deleted from our systems.

I could go on, but I think I've made my point very clear: Every social media website and art gallery is built on an assumption that the person uploading art A) retains the copyright over the items they upload, B) that other people and organizations have NO rights to copyrighted works unless explicitly stated otherwise, and C) that 3rd parties accessing this material do not have any rights to uploaded works, since they never negotiated a license to use these works.

You are misunderstanding what I'm getting at and unfortunately no this isn't just straightforwardly copyright law whatsoever. The training content does not need to be copied. It isn't saved in a database somewhere (as part of the training....downloading pirated texts is a whole other issue completely removed from the inherent processes of training a model), relationships are extracted from the material, however it is presented. So the copyright extends to the right of displaying the material in the first place. If your initial display/access to the training content is non-infringing, the mere extraction of relationships between components is not itself making a copy nor is it making a derivative work in any way we haven't historically considered it. Effectively, it's the difference between looking at material and making intensive notes of how different parts of the material relate to each other and looking at a material and reproducing as much of it as possible for your own records.

FFS, the issue is not that the AI model "copies" the copyrighted works when it trains on them--I agree that after an AI model is trained, it does not meaningfully retain the copyrighted work. The problem is that the reproduction of the copyrighted work--i.e. downloading the work to the computer, and then using that reproduction as part of AI model training--is being done for a commercial purpose that infringes copyright.

If I went to DeviantArt and downloaded a random piece of art to my hard drive for my own personal enjoyment, that is a non-infringing reproduction. If I then took that same piece of art, and uploaded it to a service that prints it on a T-shirt, the act of uploading it to the T-shirt printing service's server would be infringing, since it is no longer being reproduced for personal enjoyment, but the unlawful reproduction of copyrighted material for commercial purpose. Similarly, if I downloaded a piece of art and used it to print my own T-shirts for sale, using all my own computers and equipment, that would also be infringing. This is straightforward, non-controversial copyright law.

The exact same logic applies to AI training. You can try to camouflage the infringement with flowery language like "mere extraction of relationships between components," but the purpose and intent behind AI companies reproducing copyrighted works via web scraping and downloading copyrighted data to their servers is to build and provide a commercial, for-profit service that is designed to replace the people whose work is being infringed. Full stop.

No, this is mostly incorrect, sorry. The commercial aspect of the reproduction is not relevant to whether it is an infringement--it is simply a factor in damages and Fair Use defense (an affirmative defense that presupposes infringement).

What you are getting at when it applies to this particular type of AI is effectively whether it would be a fair use, presupposing there is copying amounting to copyright infringement. And what I am saying is that, ignoring certain stupid behavior like torrenting a shit ton of text to keep a local store of training data, there is no copying happening as a matter of necessity. There may be copying as a matter of stupidity, but it isn't necessary to the way the technology works.

Now, I know, you're raging and swearing right now because you think that downloading the data into cache constitutes an unlawful copying--but it presumably does not if it is accessed like any other content on the internet. Because intent is not a part of what makes that a lawful or unlawful copying and once a lawful distribution is made, principles of exhaustion begin to kick in and we start getting into really nuanced areas of IP law that I don't feel like delving into with my thumbs, but ultimate the point is that it isn't "basic copyright law." But if intent is determinitive of whether there is copying in the first place, how does that jive with an actor not making copies for themselves but rather accessing retained data in a third party's cache after they grab the data for noncommercial purposes? Also, how does that make sense if the model is being trained for purely research purposes? And then perhaps that model is leveraged commercially after development? Your analysis, assuming it's correct arguendo, leaves far too many outstanding substantive issues to be the ruling approach.

EDIT: also, if you download images from deviantart with the purpose of using them to make shirts or other commercial endeavor, that has no bearing on whether the download was infringing. Presumably, you downloaded via the tools provided by DA. The infringement happens when you reproduce the images for the commercial (though any redistribute is actually infringing) purpose.

The commercial aspect of the reproduction is not relevant to whether it is an infringement–it is simply a factor in damages and Fair Use defense (an affirmative defense that presupposes infringement).

What you are getting at when it applies to this particular type of AI is effectively whether it would be a fair use, presupposing there is copying amounting to copyright infringement. And what I am saying is that, ignoring certain stupid behavior like torrenting a shit ton of text to keep a local store of training data, there is no copying happening as a matter of necessity. There may be copying as a matter of stupidity, but it isn’t necessary to the way the technology works.

You're conflating whether something is infringement with defenses against infringement. Believe it or not, basically all data transfer and display of copyrighted material on the Internet is technically infringing. That includes the download of a picture to your computer's memory for the sole purpose of displaying it on your monitor. In practice, nobody ever bothers suing art galleries, social media websites, or web browsers, because they all have ironclad defenses against infringement claims: art galleries & social media include a clause in their TOS that grants them a license to redistribute your work for the purpose of displaying it on their website, and web browsers have a basically bulletproof fair use claim. There are other non-infringing uses such as those which qualify for a compulsory license (e.g. live music productions, usually involving royalties), but they're largely not very relevant here. In any case, the fundamental point is that any reproduction of a copyrighted work is infringement, but there are varied defenses against infringement claims that mean most infringing activities never see a courtroom in practice.

All this gets back to the original point I made: Creators retain their copyright even when uploading data for public use, and that copyright comes with heavy restrictions on how third parties may use it. When an individual uploads something to an art website, the website is free and clear of any claims for copyright infringement by virtue of the license granted to it by the website's TOS. In contrast, an uninvolved third party--e.g. a non-registered user or an organization that has not entered into a licensing agreement with the creator or the website (*cough* OpenAI)--has no special defense against copyright infringement claims beyond the baseline questions: was the infringement for personal, noncommercial use? And does the infringement qualify as fair use? Individual users downloading an image for their private collection are mostly A-OK, because the infringement is done for personal & noncommercial use--theoretically someone could sue over it, but there would have to be a lot of aggravating factors for it to get beyond summary judgment. AI companies using web scrapers to download creators' works do not qualify as personal/noncommercial use, for what I hope are bloody obvious reasons.

As for a model trained purely for research or educational purposes, I believe that it would have a very strong claim for fair use as long as the model is not widely available for public use. Once that model becomes publicly available, and/or is leveraged commercially, the analysis changes, because the model is no longer being used for research, but for commercial profit. To apply it to the real world, when OpenAI originally trained ChatGPT for research, it was on strong legal ground, but when it decided to start making it publicly available, they should have thrown out their training dataset and built up a new one using data in the public domain and data that it had negotiated a license for, trained ChatGPT on the new dataset, and then released it commercially. If they had done that, and if individuals had been given the option to opt their creative works out of this dataset, I highly doubt that most people would have any objection to LLM from a legal standpoint. Hell, they probably could have gotten licenses to use most websites' data to train ChatGPT for a song. Instead, they jumped the gun and tipped their hand before they had all their ducks in a row, and now everybody sees just how valuable their data is to OpenAI and are pricing it accordingly.

Oh, and as for your edit, you contradicted yourself: in your first line, you said "The commercial aspect of the reproduction is not relevant to whether it is an infringement." In your edit, you said "the infringement happens when you reproduce the images for a commercial purpose." So which is it? (To be clear, the initial download is infringing copyright both when I download the image for personal/noncommercial use, and also when I download it to make T-shirts with. The difference is that the first case has a strong defense against an infringement claim that would likely get it dismissed in summary, while the cases of making T-shirts would be straightforward claims of infringement.)

Like I've said, you are arguing this into nuanced aspects of copyright law that are absolutely not basic, but I do not agree at all with your assessment of the initial reproduction of the image in a computer's memory. First, to be clear, what you are arguing is that images on a website are licensed to the host to be reproduced for non-commercial purposes only and that such downstream access may only be non-commercial (defined very broadly--there is absolutely a strong argument here that commercial activity in this situation means direct commercial use of the reproduction; for example, you wouldn't say that a user who gets paid to look at images is commercially using the accessed images) or it violates the license. Now, even ignoring my parentheses, there are contract law and copyright law issues with this. Again, using thumbs and, honestly, I'm not trying to write a legal brief as a result of a random reply on lemmy, but the crux is that it is questionable whether you can enforce licensing terms that are presented to a licensee AFTER you enable, if not force, them to perform the act of copying your work. Effectively, you allowed them to make a copy of the work, and then you are trying to say "actually, you can only do x, y, and z with that particular copy--and this is also where exhaustion rears its head when you add on your position that once a trained model switches from non-commercial deployment to commercial deployment it can suddenly retroactively recharacterize the initial use as unlicensed infringement. Logistically, it just doesn't make sense either (for example, what happens when a further downstream user commercializes the model? Does that percolate back to recharacterize the original use? What about downstream from that? How deep into a toolchain history do you need to go to break time traveling egregious breach of exhaustion?) so I have a hard time accepting it.

Now, in response to your query wrt my edit, my point was that infringement happens when you do the further downstream reproduction of the image. When you print a unicorn on a t-shirt, it's that printing that is the infringement. The commercial aspect has absolutely no bearing on whether an infringement occurs. It is relevant to damages and the fair use affirmative defense. The sole query of whether infringement has occurred is whether a copy has been made and thus violated the copyright.

And all this is just about whether there is even a copying at the training of the models stage. This doesn't get into a fairly challenging fair use analysis (going by SCotUS' reasoning on copyrightability of API in Oracle v Google, I actually think the fair use defense is very strong, but I also don't think there is an infringement happening to even necessitate such an analysis so ymmv--also, that decision was terrible and literally every time the SCotUS has touched IP issues, it has made the law wildly worse and more expensive and time-consuming to deal with). It also doesn't get into whether outputs that are very similar to works infringe in the way music does (even though there is no actual copying--I think it highly likely it is an infringement). It also also doesn't get into how outputs might infringe even though there is no IP rights in the outputs of a generative architecture (this probably is more a weird academic issue but I like it nonetheless). Oh, and likeness rights haven't made their way into the discussion (and the incredible weirdness of a class action that includes right of publicity among its claims).

We can, and probably will, disagree on how IP law works here. That's cool. I'm not trying to litigate it on lemmy. My point in my replies at this point is just to show that it is not "basic copyright law bruh". The copyright law, and all the IP law really, around generative AI techniques is fairly complicated and nuanced. It's totally reasonable to hold the position that our current IP laws do not really address this the way most seem to want it to. In fact, most other IP attorneys I've talked to with an understanding of the technical processes at hand seem to agree. And, again, I don't think that further assetizing intangibles into a "right to extract machine learning from" is a viable path forward in the mid and long run, nor one that benefits anyone but highly monied corporate actors either.

I don't get your comment, are the pro corporate for AI or against it?

I have no personal interest in the matter, tbh. But I want people to actually understand what they're advocating for and what the downstream effects would inevitably be. Model training is not inherently infringing activity under current IP law. It just isn't. Neither the law, legislative or judicial, nor the actual engineering and operations of these current models support at all a finding of infringement. Effectively, this means that new legislation needs to be made to handle the issue. Most are effectively advocating for an entirely new IP right in the form of a "right to learn from" which further assetizes ideas and intangibles such that we get further shuffled into endstage capitalism, which most advocates are also presumably against.

I'm pretty sure most people are just mad that this is basically "rules for thee but not for me", why should a company be free to pirate but I can't? Case in point is the internet archive losing their case against a publisher. That's the crux of the issue.

I get that that's how it feels given how it's being reported, but the reality is that due to the way this sort of ML works, what internet archive does and what an arbitrary GPT does are completely different, with the former being an explicit and straightforward copy relying on Fair Use defense and the latter being the industrialized version of intensive note taking into a notebook full of such notes while reading a book. That the outputs of such models are totally devoid of IP protections actually makes a pretty big difference imo in their usefulness to the entities we're most concerned about, but that certainly doesn't address the economic dilemma of putting an entire sector of labor at risk in narrow areas.

Those claiming AI training on copyrighted works is "theft" misunderstand key aspects of copyright law and AI technology.

Or maybe they're not talking about copyright law. They're talking about basic concepts. Maybe copyright law needs to be brought into the 21st century?

Considering that original works are discarded, it's strange how effective they're at plagiarizing them

Yep, its definitely not possible that nice small businesses like universal and sony would sue without an actual case in order to try and crush competitors with costs.

In the same way that a person can learn the material and also use that knowledge to potentially plagiarize it, though. It's no different in that sense. What is different is the speed of learning and both the speed and capacity of recall. However, it doesn't change the fundamental truths of OP's explanation.

Also, when you're talking specifically about music, you're talking about a very limited subset of note combinations that will sound pleasing to human ears. Additionally, even human composers commonly struggle to not simply accidentally reproduce others' work, which is partly why the music industry is filled with constant copyright litigation.

I thought the larger point was that they're using plenty of sources that do not lie in the public domain. Like if I download a textbook to read for a class instead of buying it - I could be proscecuted for stealing. And they've downloaded and read millions of books without paying for them.

And they've downloaded and read millions of books without paying for them.

Do you have a source on that?

Though I am not a lawyer by training, I have been involved in such debates personally and professionally for many years. This post is unfortunately misguided. Copyright law makes concessions for education and creativity, including criticism and satire, because we recognize the value of such activities for human development. Debates over the excesses of copyright in the digital age were specifically about humans finding the application of copyright to the internet and all things digital too restrictive for their educational, creative, and yes, also their entertainment needs. So any anti-copyright arguments back then were in the spirit specifically of protecting the average person and public-interest non-profit institutions, such as digital archives and libraries, from big copyright owners who would sue and lobby for total control over every file in their catalogue, sometimes in the process severely limiting human potential.

AI’s ingesting of text and other formats is “learning” in name only, a term borrowed by computer scientists to describe a purely computational process. It does not hold the same value socially or morally as the learning that humans require to function and progress individually and collectively.

AI is not a person (unless we get definitive proof of a conscious AI, or are willing to grant every implementation of a statistical model personhood). Also AI it is not vital to human development and as such one could argue does not need special protections or special treatment to flourish. AI is a product, even more clearly so when it is proprietary and sold as a service.

Unlike past debates over copyright, this is not about protecting the little guy or organizations with a social mission from big corporate interests. It is the opposite. It is about big corporate interests turning human knowledge and creativity into a product they can then use to sell services to - and often to replace in their jobs - the very humans whose content they have ingested.

See, the tables are now turned and it is time to realize that copyright law, for all its faults, has never been only or primarily about protecting large copyright holders. It is also about protecting your average Joe from unauthorized uses of their work. More specifically uses that may cause damage, to the copyright owner or society at large. While a very imperfect mechanism, it is there for a reason, and its application need not be the end of AI. There’s a mechanism for individual copyright owners to grant rights to specific uses: it’s called licensing and should be mandatory in my view for the development of proprietary LLMs at least.

TL;DR: AI is not human, it is a product, one that may augment some tasks productively, but is also often aimed at replacing humans in their jobs - this makes all the difference in how we should balance rights and protections in law.

Studied AI at uni. I'm also a cyber security professional. AI can be hacked or tricked into exposing training data. Therefore your claim about it disposing of the training material is totally wrong.

Ask your search engine of choice what happened when Gippity was asked to print the word "book" indefinitely. Answer: it printed training material after printing the word book a couple hundred times.

Also my main tutor in uni was a neuroscientist. Dude straight up told us that the current AI was only capable of accurately modelling something as complex as a dragon fly. For larger organisms it is nowhere near an accurate recreation of a brain. There are complexities in our brain chemistry that simply aren't accounted for in a statistical inference model and definitely not in the current gpt models.

That knowledge is out of date and out of touch. While it's possible to expose small bits of training data, that's akin to someone being able to recall a portion of the memory of the scene they saw. However, those exercises essentially took what sometimes equates to weeks or months of interrogation method knowledge gained over time employed by people looking to target specific types of responses. Think of it like a skilled police interrogator tricking a toddler out of one of their toys by threatening them or offering them something until it worked. Nowadays, that's getting far more difficult to do and they're spending a lot more time and expertise to do it.

Also, consider how complex a dragonfly is and how young this technology is. Very little in tech has ever progressed that fast. Give it five more years and come back to laugh at how naive your comment will seem.

Dammit, so my comment to the other person was a mix of a reply to this one and the last one... not having a good day for language processing, ironically.

Specifically on the dragonfly thing, I don't think I'll believe myself naive for writing that post or this one. Dragonflies arent very complex and only really have a few behaviours and inputs. We can accurately predict how they will fly. I brought up the dragonfly to mention the limitations of the current tech and concepts. Given the worlds computing power and research investment, the best we can do is a dragonfly for intelligence.

To be fair, Scientists don't entirely understand neurons and ML designed neuron-data structures behave similarly to very early ideas of what brains do but its based on concepts from the 1950s. There are different segments of the brain which process different things and we sort of think we know what they all do but most of the studies AI are based on is honestly outdated neuroscience. OpenAI seem to think if they stuff enough data into this language processor it will become sentient and want an exemption from copyright law so they can be profitable rather than actually improving the tech concepts and designs.

Newer neuroscience research suggest neurons perform differently based on the brain chemicals present, they don't all always fire at every (or even most) input and they usually present a train of thought, I.e. thoughts literally move around in the brains areas. This is all very different to current ML implementations and is frankly a good enough reason to suggest the tech has a lot of room to develop. I like the field of research and its interesting to watch it develop but they can honestly fuck off telling people they need free access to the world's content.

TL;DR dragonflies aren't that complex and the tech has way more room to grow. However, they have to generate revenue to keep going so they're selling a large inference machine that relies on all of humanities content to generate the wrong answer to 2+2.

Your first point is misguided and incorrect. If you've ever learned something by 'cramming', a.k.a. just repeating ingesting material until you remember it completely. You don't need the book in front of you anymore to write the material down verbatim in a test. You still discarded your training material despite you knowing the exact contents. If this was all the AI could do it would indeed be an infringement machine. But you said it yourself, you need to trick the AI to do this. It's not made to do this, but certain sentences are indeed almost certain to show up with the right conditioning. Which is indeed something anyone using an AI should be aware of, and avoid that kind of conditioning. (Which in practice often just means, don't ask the AI to make something infringing)

I think you're anthropomorphising the tech tbh. It's not a person or an animal, it's a machine and cramming doesn't work in the idea of neural networks. They're a mathematical calculation over a vast multidimensional matrix, effectively solving a polynomial of an unimaginable order. So "cramming" as you put it doesn't work because by definition an LLM cannot forget information because once it's applied the calculations, it is in there forever. That information is supposed to be blended together. Overfitting is the closest thing to what you're describing, which would be inputting similar information (training data) and performing the similar calculations throughout the network, and it would therefore exhibit poor performance should it be asked do anything different to the training.

What I'm arguing over here is language rather than a system so let's do that and note the flaws. If we're being intellectually honest we can agree that a flaw like reproducing large portions of a work doesn't represent true learning and shows a reliance on the training data, i.e. it cant learn unless it has seen similar data before and certain inputs provide a chance it just parrots back the training data.

In the example (repeat book over and over), it has statistically inferred that those are all the correct words to repeat in that order based on the prompt. This isn't akin to anything human, people can't repeat pages of text verbatim like this and no toddler can be tricked into repeating a random page from a random book as you say. The data is there, it's encoded and referenced when the probability is high enough. As another commenter said, language itself is a powerful tool of rules and stipulations that provide guidelines for the machine, but it isn't crafting its own sentences, it's using everyone else's.

Also, calling it "tricking the AI" isn't really intellectually honest either, as in "it was tricked into exposing it still has the data encoded". We can state it isn't preferred or intended behaviour (an exploit of the system) but the system, under certain conditions, exhibits reuse of the training data and the ability to replicate it almost exactly (plagiarism). Therefore it is factually wrong to state that it doesn't keep the training data in a usable format - which was my original point. This isn't "cramming", this is encoding and reusing data that was not created by the machine or the programmer, this is other people's work that it is reproducing as it's own. It does this constantly, from reusing StackOverflow code and comments to copying tutorials on how to do things. I was showing a case where it won't even modify the wording, but it reproduces articles and programs in their structure and their format. This isn't originality, creativity or anything that it is marketed as. It is storing, encoding and copying information to reproduce in a slightly different format.

EDITS: Sorry for all the edits. I mildly changed what I said and added some extra points so it was a little more intelligible and didn't make the reader go "WTF is this guy on about". Not doing well in the written department today so this was largely gobbledegook before but hopefully it is a little clearer what I am saying.

I rather think the point is being missed here. Copyright is already causing huge issues, such as the troubles faced by the internet archive, and the fact academics get nothing from their work.

Surely the argument here is that copyright law needs to change, as it acts as a barrier to education and human expression. Not, however, just for AI, but as a whole.

Copyright law needs to move with the times, as all laws do.

Copyright is a lesser evil compared to taking human labor and creativity for free to sell a product.

Not just that, but to sell a product that by its very nature threatens the livelihoods of the same people whose labor and creativity is being used without permission.

I'll train my AI on just the bee movie. Then I'm going to ask it "can you make me a movie about bees"? When it spits the whole movie, I can just watch it or sell it or whatever, it was a creation of my AI, which learned just like any human would! Of course I didn't even pay for the original copy to train my AI, it's for learning purposes, and learning should be a basic human right!

That would be like you writing out the bee movie yourself after memorizing the whole movie and claiming it is your own idea or using it as proof that humans memorizing a movie is violating copyright. Just because an AI is violating copyright by outputting the whole bee movie, it doesn't mean training the AI on copyright stuff is violating copyright.

Let's just punish the AI companies for outputting copyright stuff instead of for training with them. Maybe that way they would actually go out of their way to make their LLM intelligent enough to not spit out copyrighted content.

Or, we can just make it so that any output made by an AI that is trained on copyrighted stuff cannot be copyrighted.

If the solution is making the output non-copyrighted it fixes nothing. You can sell the pirating machine on a subscription. And it's not like Netflix where the content ends when the subscription ends, you have already downloaded all the not-copyrighted content you wanted, and the internet would be full of non-copyrighted AI output.

Instead of selling the bee movie, you sell a bee movie maker, and a spiderman maker, and a titanic maker.

Sure, file a copyright infringement each time you manage to make an AI output copyrighted content. Just run it on a loop and it's a money making machine. That's fine by me.

Yeah, because running the AI also have some cost, so you are selling the subscription to run the AI on their server, not it's output.

I'm not sure what is the legality of selling a bee movie maker, so you'd have to research that one yourself.

It's not really a money making machine if you lose more money running the AI on your server farm, but whatever floats your boat. Also, there are already lawsuits based on outputs created from chatgpt, so it is exactly what is already happening.

Yeah, making sandwiches also costs money! I have to pay my sandwich making employees to keep the business profitable! How do they expect me to pay for the cheese?

EDIT: also, you completely missed my point. The money making machine is the AI because the copyright owners could just use them every time it produces copyright-protected material if we decided to take that route, which is what the parent comment suggested.

They should pay for the cheese, I'm not arguing against that, but they should be paying it the same amount as a normal human would if they want access to that cheese. No extra fees for access to copyrighted material if you want to use it to train AI vs wanting to consume it yourself.

And I didn't miss your point. My point was that the reality is already occurring since people are already suing OpenAI for ChatGPT outputs that the people suing are generating themselves, so it's no longer just a hypothetical. We'll see if it is a money making machine for them or will they just waste their resources from doing that.

Media is not exactly like cheese though. With cheese, you buy it and it's yours. Media, however, is protected by copyright. When you watch a movie, you are given a license to watch the movie.

When an AI watches a movie, it's not really watching it, it's doing a different action. If the license of the movie says "you can't use this license to train AI, use the other (more expensive) license for such purposes", then AIs have extra fees to access the content that humans don't have to pay.

Both humans and AI consume the content, even if they do not do so in the exact same way. I don't see the need to differentiate that. It's not like we have any idea of the mechanism by which humans consume a content to make the differentiation in the first place.

Don't need to get philosophical about what is the difference between human and AI learning.

"Consumed by AI" and "consumed by a human" are two distinct use cases that can have different terms in a license.

Why do we need to differentiate those two use cases, anyway? It's not like they differentiate between a single human or multiple humans consuming the content, or if there are non-humans also consuming it. Differentiating those two use cases is just another example of publishers wanting more money due to greed. I'm not sure why Lemmy is so supportive of that.

We need to differentiate between those cases because they are 2 distinct cases. And they are very different.

They don't even have the same purpose. The purpose of a human learning is: fulfill a desire to learn or acquiring a new skill that will be useful to fulfill another desire. The purpose of AI learning is: increase the value of the model so it can be sold for more.

Lemmy is not an entity that is capable of thought. And I'm not Lemmy. I'm just another person and what you are reading is my opinion.

"Publishers are bad and greedy, therefore everything that hurts them is good for society" is a childish take imo. Not everything is black and white. Copyright exists for a reason. Just removing it won't make the world better. A law being flawed doesn't make it worse than not existing.

Bullshit. AI are not human. We shouldn't treat them as such. AI are not creative. They just regurgitate what they are trained on. We call what it does "learning", but that doesn't mean we should elevate what they do to be legally equal to human learning.

It's this same kind of twisted logic that makes people think Corporations are People.

Ok, ignore this specific company and technology.

In the abstract, if you wanted to make artificial intelligence, how would you do it without using the training data that we humans use to train our own intelligence?

We learn by reading copyrighted material. Do we pay for it? Sometimes. Sometimes a teacher read it a while ago and then just regurgitated basically the same copyrighted information back to us in a slightly changed form.

The things is, they can have scads of free stuff that is not copyrighted. But they are greedy and want copyrighted stuff, too

We all should. Copyright is fucking horseshit.

It costs literally nothing to make a digital copy of something. There is ZERO reason to restrict access to things.

Making a copy is free. Making the original is not. I don't expect a professional photographer to hand out their work for free because making copies of it costs nothing. You're not paying for the copy, you're paying for the money and effort needed to create the original.

Making a copy is free. Making the original is not.

Yes, exactly. Do you see how that is different from the world of physical objects and energy? That is not the case for a physical object. Even once you design something and build a factory to produce it, the first item off the line takes the same amount of resources as the last one.

Capitalism is based on the idea that things are scarce. If I have something, you can't have it, and if you want it, then I have to give up my thing, so we end up trading. Information does not work that way. We can freely copy a piece of information as much as we want. Which is why monopolies and capitalism are a bad system of rewarding creators. They inherently cause us to impose scarcity where there is no need for it, because in capitalism things that are abundant do not have value. Capitalism fundamentally fails to function when there is abundance of resources, which is why copyright was a dumb system for the digital age. Rather than recognize that we now live in an age of information abundance, we spend billions of dollars trying to impose artificial scarcity.

You sound like someone who has not tried to make an artistic creation for profit.

You sound like someone unwilling to think about a better system.

Better system for WHOM? Tech-bros that want to steal my content as their own?

I'm a writer, performing artist, designer, and illustrator. I have thought about copyright quite a bit. I have released some of my stuff into the public domain, as well as the Creative Commons. If you want to use my work, you may - according to the licenses that I provide.

I also think copyright law is way out of whack. It should go back to - at most - life of author. This "life of author plus 95 years" is ridiculous. I lament that so much great work is being lost or forgotten because of the oppressive copyright laws - especially in the area of computer software.

But tech-bros that want my work to train their LLMs - they can fuck right off. There are legal thresholds that constitute "fair use" - Is it used for an academic purpose? Is it used for a non-profit use? Is the portion that is being used a small part or the whole thing? LLM software fail all of these tests.

They can slurp up the entirety of Wikipedia, and they do. But they are not satisfied with the free stuff. But they want my artistic creations, too, without asking. And they want to sell something based on my work, making money off of my work, without asking.

Better system for WHOM? Tech-bros that want to steal my content as their own?

A better system for EVERYONE. One where we all have access to all creative works, rather than spending billions on engineers nad lawyers to create walled gardens and DRM and artificial scarcity. What if literally all the money we spent on all of that instead went to artist royalties?  

But tech-bros that want my work to train their LLMs - they can fuck right off. There are legal thresholds that constitute “fair use” - Is it used for an academic purpose? Is it used for a non-profit use? Is the portion that is being used a small part or the whole thing? LLM software fail all of these tests.

No. It doesn't.

They can literally pass all of those tests.

You are confusing OpenAI keeping their LLM closed source and charging access to it, with LLMs in general. The open source models that Microsoft and Meta publish for instance, pass literally all of the criteria you just stated.

They literally do not pass the criteria. LLMs use the entirety of a copyrighted work for their training, which fails the "amount and substantiality" factor. By their very nature, LLMs would significantly devalue the work of every artist, author, journalist, and publishing organization, on an industry-wide scale, which fails the "Effect upon work's value" factor.

Those two alone would be enough for any sane judge to rule that training LLMs would not qualify as fair use, but then you also have OpenAI and other commercial AI companies offering the use of these models for commercial, for-profit purposes, which also fails the "Purpose and character of the use" factor. You could maybe argue that training LLMs is transformative, but the commercial, widespread nature of this infringement would weigh heavily against that. So that's at least two, and arguably three out of four factors where it falls short.

LLMs use the entirety of a copyrighted work for their training, which fails the "amount and substantiality" factor.

That factor is relative to what is reproduced, not to what is ingested. A company is allowed to scrape the web all they want as long as they don't republish it.

By their very nature, LLMs would significantly devalue the work of every artist, author, journalist, and publishing organization, on an industry-wide scale, which fails the "Effect upon work's value" factor.

I would argue that LLMs devalue the author's potential for future work, not the original work they were trained on.

Those two alone would be enough for any sane judge to rule that training LLMs would not qualify as fair use, but then you also have OpenAI and other commercial AI companies offering the use of these models for commercial, for-profit purposes, which also fails the "Purpose and character of the use" factor.

Again, that's the practice of OpenAI, but not inherent to LLMs.

You could maybe argue that training LLMs is transformative,

It's honestly absurd to try and argue that they're not transformative.

That factor is relative to what is reproduced, not to what is ingested. A company is allowed to scrape the web all they want as long as they don’t republish it.

The work is reproduced in full when it's downloaded to the server used to train the AI model, and the entirety of the reproduced work is used for training. Thus, they are using the entirety of the work.

I would argue that LLMs devalue the author’s potential for future work, not the original work they were trained on.

And that makes it better somehow? Aereo got sued out of existence because their model threatened the retransmission fees that broadcast TV stations were being paid by cable TV subscribers. There wasn't any devaluation of broadcasters' previous performances, the entire harm they presented was in terms of lost revenue in the future. But hey, thanks for agreeing with me?

Again, that’s the practice of OpenAI, but not inherent to LLMs.

And again, LLM training so egregiously fails two out of the four factors for judging a fair use claim that it would fail the test entirely. The only difference is that OpenAI is failing it worse than other LLMs.

It’s honestly absurd to try and argue that they’re not transformative.

It's even more absurd to claim something that is transformative automatically qualifies for fair use.

The work is reproduced in full when it’s downloaded to the server used to train the AI model, and the entirety of the reproduced work is used for training. Thus, they are using the entirety of the work.

That's objectively false. It's downloaded to the server, but it should never be redistributed to anyone else in full. As a developer for instance, it's illegal for me to copy code I find in a medium article and use it in our software. I'm perfectly allowed to read that Medium article, learn from it, and then right my own similar code.

And that makes it better somehow? Aereo got sued out of existence because their model threatened the retransmission fees that broadcast TV stations were being paid by cable TV subscribers. There wasn’t any devaluation of broadcasters’ previous performances, the entire harm they presented was in terms of lost revenue in the future. But hey, thanks for agreeing with me?

And Aero should not have lost that suit. That's an example of the US court system abjectly failing.

And again, LLM training so egregiously fails two out of the four factors for judging a fair use claim that it would fail the test entirely. The only difference is that OpenAI is failing it worse than other LLMs.

That's what we're debating, not a given.

It’s even more absurd to claim something that is transformative automatically qualifies for fair use.

Fair point, but it is objectively transformative.

We learn by reading copyrighted material.

We are human beings. The comparison is false on it's face because what you all are calling AI isn't in any conceivable way comparable to the complexity and versatility of a human mind, yet you continue to spit this lie out, over and over again, trying to play it up like it's Data from Star Trek.

This model isn't "learning" anything in any way that is even remotely like how humans learn. You are deliberately simplifying the complexity of the human brain to make that comparison.

Moreover, human beings make their own choices, they aren't actual tools.

They pointed a tool at copyrighted works and told it to copy, do some math, and regurgitate it. What the AI "does" is not relevant, what the people that programmed it told it to do with that copyrighted information is what matters.

There is no intelligence here except theirs. There is no intent here except theirs.

This model isn’t “learning” anything in any way that is even remotely like how humans learn. You are deliberately simplifying the complexity of the human brain to make that comparison.

I do think the complexity of artificial neural networks is overstated. A real neuron is a lot more complex than an artificial one, and real neurons are not simply feed forward like ANNs (which have to be because they are trained using back-propagation), but instead have their own spontaneous activity (which kinda implies that real neural networks don't learn using stochastic gradient descent with back-propagation). But to say that there's nothing at all comparable between the way humans learn and the way ANNs learn is wrong IMO.

If you read books such as V.S. Ramachandran and Sandra Blakeslee's Phantoms in the Brain or Oliver Sacks' The Man Who Mistook His Wife For a Hat you will see lots of descriptions of patients with anosognosia brought on by brain injury. These are people who, for example, are unable to see but also incapable of recognizing this inability. If you ask them to describe what they see in front of them they will make something up on the spot (in a process called confabulation) and not realize they've done it. They'll tell you what they've made up while believing that they're telling the truth. (Vision is just one example, anosognosia can manifest in many different cognitive domains).

It is V.S Ramachandran's belief that there are two processes that occur in the Brain, a confabulator (or "yes man" so to speak) and an anomaly detector (or "critic"). The yes-man's job is to offer up explanations for sensory input that fit within the existing mental model of the world, whereas the critic's job is to advocate for changing the world-model to fit the sensory input. In patients with anosognosia something has gone wrong in the connection between the critic and the yes man in a particular cognitive domain, and as a result the yes-man is the only one doing any work. Even in a healthy brain you can see the effects of the interplay between these two processes, such as with the placebo effect and in hallucinations brought on by sensory deprivation.

I think ANNs in general and LLMs in particular are similar to the yes-man process, but lack a critic to go along with it.

What implications does that have on copyright law? I don't know. Real neurons in a petri dish have already been trained to play games like DOOM and control the yoke of a simulated airplane. If they were trained instead to somehow draw pictures what would the legal implications of that be?

There's a belief that laws and political systems are derived from some sort of deep philosophical insight, but I think most of the time they're really just whatever works in practice. So, what I'm trying to say is that we can just agree that what OpenAI does is bad and should be illegal without having to come up with a moral imperative that forces us to ban it.

We are human beings. The comparison is false on it's face because what you all are calling AI isn't in any conceivable way comparable to the complexity and versatility of a human mind, yet you continue to spit this lie out, over and over again, trying to play it up like it's Data from Star Trek.

If you fundamentally do not think that artificial intelligences can be created, the onus is on yo uto explain why it's impossible to replicate the circuitry of our brains. Everything in science we've seen this far has shown that we are merely physical beings that can be recreated physically.

Otherwise, I asked you to examine a thought experiment where you are trying to build an artificial intelligence, not necessarily an LLM.

This model isn't "learning" anything in any way that is even remotely like how humans learn. You are deliberately simplifying the complexity of the human brain to make that comparison.

Or you are over complicating yourself to seem more important and special. Definitely no way that most people would be biased towards that, is there?

Moreover, human beings make their own choices, they aren't actual tools.

Oh please do go ahead and show us your proof that free will exists! Thank god you finally solved that one! I heard people were really stressing about it for a while!

They pointed a tool at copyrighted works and told it to copy, do some math, and regurgitate it. What the AI "does" is not relevant, what the people that programmed it told it to do with that copyrighted information is what matters.

"I don't know how this works but it's math and that scares me so I'll minimize it!"

If we have an AI that's equivalent to humanity in capability of learning and creative output/transformation, it would be immoral to just use it as a tool. At least that's how I see it.

I think that's a huge risk, but we've only ever seen a single, very specific type of intelligence, our own / that of animals that are pretty closely related to us.

Movies like Ex Machina and Her do a good job of pointing out that there is nothing that inherently means that an AI will be anything like us, even if they can appear that way or pass at tasks.

It's entirely possible that we could develop an AI that was so specifically trained that it would provide the best script editing notes but be incapable of anything else for instance, including self reflection or feeling loss.

And that's all paid for. Think how much just the average high school graduate has has invested in them, ai companies want all that, but for free

It's not though.

A huge amount of what you learn, someone else paid for, then they taught that knowledge to the next person, and so on. By the time you learned it, it had effectively been pirated and copied by human brains several times before it got to you.

Literally anything you learned from a Reddit comment or a Stack Overflow post for instance.

If only there was a profession that exchanges knowledge for money. Some one who "teaches." I wonder who would pay them

Am I the only person that remembers that it was "you wouldn't steal a car" or has everyone just decided to pretend it was "you wouldn't download a car" because that's easier to dunk on.

People remember the parody, which is usually modified to be more recognizable. Like Darth Vader never said "Luke, I am your father"; in the movie it's actually "No, I am your father".

So, is the Internet caring about copyright now? Decades of Napster, Limewire, BitTorrent, Piratebay, bootleg ebooks, movies, music, etc, but we care now because it's a big corporation doing it?

Just trying to get it straight.

Personally for me its about the double standard. When we perform small scale "theft" to experience things we'd be willing to pay for if we could afford it and the money funded the artists, they throw the book at us. When they build a giant machine that takes all of our work and turns it into an automated record scratcher that they will profit off of and replace our creative jobs with, that's just good business. I don't think it's okay that they get to do things like implement DRM because IP theft is so terrible, but then when they do it systemically and against the specific licensing of the content that has been posted to the internet, that's protected in the eyes of the law

Kill a person, that's a tragedy. Kill a hundred thousand people, they make you king.

Steal $10, you go to jail. Steal $10 billion, they make you Senator.

If you do crime big enough, it becomes good.

What about companies who scrape public sites for training data but then publish their trained models open source for anyone to use?

That feels a lot more reasonable and fair to me personally.

I mean openais not getting off Scott free, they've been getting sued a lot recently for this exact copy right argument. New York times is suing them for potential billions.

They throw the book at us

Do they though, since the Metallica lawsuits in the aughts there hasnt been much prosecution at the consumer level for piracy, and what little there is is mostly cease and desists.

People don't like when you punch down. When a 13 year old illegally downloaded a Limp Bizkit album no one cared. When corporations worth billions funded by venture capital systematically harvest the work of small creators (often with appropriate license) to sell a product people tend to care.

There is a kernal of validity to your point, but let's not pretend like those things are at all the same. The difference between copyright violation for personal use and copyright violation for commercialization is many orders of magnitude.

The Internet is not a person

People on Lemmy. I personally didn't realize everyone here was such big fans of copyright and artificial scarcity.

The reality is that people hate tech bros (deservedly) and then blindly hate on everything they like by association, which sometimes results in dumbassery like everyone now dick-riding the copyright system.

The reality is that people hate the corporations using creative peoples works to try and make their jobs basically obsolete and they grab onto anything to fight against it, even if it's a bit of a stretch.

I'd hate a world lacking real human creativity.

Me too, but real human creativity comes from having the time and space to rest and think properly. Automation is the only reason we have as much leisure time as we do on a societal scale now, and AI just allows us to automate more menial tasks.

Do you know where AI is actually being used the most right now? Automating away customer service jobs, automatic form filling, translation, and other really boring but necessary tasks that computers used to be really bad at before neural networks.

And some automation I have no problems with. However, if corporations would rather use AI than hire creatives, the creatives will have to look for other work and likely won't have a space to express their creativity, not at work nor during leisure time (no time, exhaustion, etc.). Something should be done so it doesn't go there. Preemptively. Not after everything's gone to shit. I don't see the people defending AI from the copyright stuff even acknowledging the issue. Holding up the copyright card, currently, is the easiest way to try an avoid this happening.

I don't think LLMs should be taken down, it would be impossible for that to happen. I do, however think it should be forced into open source.

This is the only way. These companies are essentially asking for a free license for themselves while everyone else must pay.

"Copyright for thee but not for me."

Will your warez be legal after you wrap them in an AI model, or only if you are a big, greedy, invasive, tech company?

Wow, thanks, I have not seen this comment, yet I hinted about this in some of my other replies that I've done before.

Yes, I think ML is fair use, but there it would also be fair to force something into the public domain/open source if, in order to be accrued, it has to make use of fair use at unseen amounts of scale.

This would be a difficult to make law, though. Current ML is very inefficient in the amount of data it requires, but it could (and should) be made better.

Yes. I'd also add that current copyright laws are archaic and counterproductive when combined with modern technology.

Creators need protection, but only for 15 years. Not death + 70 years.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.

Like fuck it is. An LLM "learns" by memorization and by breaking down training data into their component tokens, then calculating the weight between these tokens. This allows it to produce an output that resembles (but may or may not perfectly replicate) its training dataset, but produces no actual understanding or meaning--in other words, there's no actual intelligence, just really, really fancy fuzzy math.

Meanwhile, a human learns by memorizing training data, but also by parsing the underlying meaning and breaking it down into the underlying concepts, and then by applying and testing those concepts, and mastering them through practice and repetition. Where an LLM would learn "2+2 = 4" by ingesting tens or hundreds of thousands of instances of the string "2+2 = 4" and calculating a strong relationship between the tokens "2+2," "=," and "4," a human child would learn 2+2 = 4 by being given two apple slices, putting them down to another pair of apple slices, and counting the total number of apple slices to see that they now have 4 slices. (And then being given a treat of delicious apple slices.)

Similarly, a human learns to draw by starting with basic shapes, then moving on to anatomy, studying light and shadow, shading, and color theory, all the while applying each new concept to their work, and developing muscle memory to allow them to more easily draw the lines and shapes that they combine to form a whole picture. A human may learn off other peoples' drawings during the process, but at most they may process a few thousand images. Meanwhile, an LLM learns to "draw" by ingesting millions of images--without obtaining the permission of the person or organization that created those images--and then breaking those images down to their component tokens, and calculating weights between those tokens. There's about as much similarity between how an LLM "learns" compared to human learning as there is between my cat and my refrigerator.

And YET FUCKING AGAIN, here's the fucking Google Books argument. To repeat: Google Books used a minimal portion of the copyrighted works, and was not building a service to compete with book publishers. Generative AI is using the ENTIRE COPYRIGHTED WORK for its training set, and is building a service TO DIRECTLY COMPETE WITH THE ORGANIZATIONS WHOSE WORKS THEY ARE USING. They have zero fucking relevance to one another as far as claims of fair use. I am sick and fucking tired of hearing about Google Books.

EDIT: I want to make another point: I've commissioned artists for work multiple times, featuring characters that I designed myself. And pretty much every time I have, the art they make for me comes with multiple restrictions: for example, they grant me a license to post it on my own art gallery, and they grant me permission to use portions of the art for non-commercial uses (e.g. cropping a portion out to use as a profile pic or avatar). But they all explicitly forbid me from using the work I commissioned for commercial purposes--in other words, I cannot slap the art I commissioned on a T-shirt and sell it at a convention, or make a mug out of it. If I did so, that artist would be well within their rights to sue the crap out of me, and artists charge several times as much to grant a license for commercial use.

In other words, there is already well-established precedent that even if something is publicly available on the Internet and free to download, there are acceptable and unacceptable use cases, and it's broadly accepted that using other peoples' work for commercial use without compensating them is not permitted, even if I directly paid someone to create that work myself.

If you put a gazillion monkeys on a typewriter they can write Shakespeare.

If you train one ai for a ton of epochs it can write Shakespeare.

All pure mathematical coincidence.

If you put a gazillion monkeys on a typewriter they can write Shakespeare.

This is a mathematical curiosity borne out of pure randomness. An LLM trained on a dataset to generate similar content is quite the opposite of randomness.

It was the best of times, it was the BLURST OF TIMES! Stupid monkey!

I recently visited a museum and i really loved it. Getting up close to an image and seeing none of the fuzziness, no AI "shimmer" on photos and every stroke made sense (as in you could see that an arm moved a brush and you could see the path it took etc.). Hands made sense. And while tryptichons were not exactly precise when it comes to the anatomy of humans, no humans had anything smeared etc.

Like fuck it is. An LLM "learns" by memorization and by breaking down training data into their component tokens, then calculating the weight between these tokens.

But this is, at a very basic fundamental level, how biological brains learn. It's not the whole story, but it is a part of it.

there's no actual intelligence, just really, really fancy fuzzy math.

You mean sapience or consciousness. Or you could say "human-level intelligence". But LLM's by definition have real "actual" intelligence, just not a lot of it.

Edit for the lowest common denominator: I'm suggesting a more accurate way of phrasing the sentence, such as "there's no actual sapience" or "there's no actual consciousness". /end-edit

an LLM would learn "2+2 = 4" by ingesting tens or hundreds of thousands of instances of the string "2+2 = 4" and calculating a strong relationship between the tokens "2+2," "=," and "4,"

This isn't true. At all. There are math specific benchmarks made by experts to specifically test the problem solving and domain specific capabilities of LLM's. And you can be sure they aren't "what's 2 + 2?"

I'm not here to make any claims about the ethics or legality of the training. All I'm commenting on is the science behind LLM's.

Get a load of this maroon, they think LLMs are actually sapient! Thanks, I needed that laugh.

The joke is of course that "paying for copyright" is impossible in this case. ONLY the large social media companies that own all the comments and content that has accumulated by the community have enough data to train AI models. Or sites like stock photo libraries or deviantart who own the distribution rights for the content. That means all copyright arguments practically argue that AI should be owned by big corporations and should be inaccessible to normal people.

Basically the "means of generation" will be owned by the capitalists, since they are the only ones with the economic power to license these things.

That is basically the worst case scenario. Not only will the value of work diminish greatly, the advances in productivity will also be only accessible to big capitalists.

Of course, that is basically inevitable anyway. Why wouldn't they want this? It's just sad seeing the stupid morons arguing for this as if they had anything to gain.

I'm getting really tired of saying this over and over on the Internet and getting either ignored or pounced on by pompous AI bros and boomers, but this "there isn't enough free data" claim has never been tested. The experiments that have come close (look up the early Phi and Starcoder papers, or the CommonCanvas text-to-image model) suggested that the claim is false, by showing that a) models trained on small, well-curated datasets can match and outperform models trained on lazily curated large web scrapes, and b) models trained solely on permissively licensed data can perform on par with at least the earlier versions of models trained more lazily (e.g. StarCoder 1.5 performing on par with Code-Davinci). But yes, a social network or other organization that has access to a bunch of data that they own, or have licensed, could almost certainly fine-tune a base LLM trained solely on permissively licensed data to get a tremendously useful tool that would probably be safer and more helpful than ChatGPT for that organization's specific business, at vastly lower risk of copyright claims or toxic generated content, for that matter.

I never fully figured out how the people who are against AI companies using copyrighted content on the training data fit that in with their general attitude towards online piracy. Seems contradictory to be against one but not another.

Your average pirate isn't looking to profit from their copyright infringement.

In a similar way, someone getting busted for downloading a movie is a civil matter, but if they get busted for selling unauthorized copies on DVD then it can become a criminal matter.

They're saving money which is effectively the same thing.

The pirate is looking to save money with their copyright infringement.

These AI companies are looking to make money from it.

There's no practical difference between the two.

If I save 100 bucks a month from my expenses it means I have an extra 100 bucks to spend on something else.

If I earn additional 100 bucks a month it means I have an extra 100 bucks to spend on something else.

The scale is the difference and who is harmed.

Billion dollar company losing $100. Who cares?!

Billion dollar company stealing from all artists in the world. We care.

That may be how you see it, but that's not how the law works.

Well that's not just how I see it, that's how it is.

Also, piracy is illegal. If you think taking copyrighted work of others without permission and training your AI with it should be illegal aswell, then there's no contradiction there. The people I do take issue with is the ones who see an issue with training AI but not with online piracy.

Well, you can think that but realize that you're in the minority if you think breaking copyright for personal consumption is the same as breaking copyright for profit. That's like saying stealing a loaf of bread because you are hungry is exactly the same as stealing a car so you can strip it for parts for resale.

Also, despite what the RIAA and MPAA would like you to believe, downloading a CD or DVD for personal use isn't illegal, which is why it's a civil matter when someone is busted. There's a line that needs to be crossed before the criminal justice system gets involved, and it's above that sort of thing.

Pirating movies or games for personal use is for profit. You're saving money, which is effectively the same thing as earning money. The difference is in scale, not in kind. Just because you as an individual person are causing less harm by pirating content than a major corporation is, it doesn't mean you're not still commiting the exact same crime both legally and morally speaking.

You're using a weird definition of profit, which to most people is some sort of financial gain. Saving money isn't the same as profiting. You're not turning a profit when you use a $1.00 off coupon on a package of Oreos at the grocer just like you're not turning a profit if you download a movie.

Also, go look up criminal copyright infringement. That's what is defined as a crime legally, and downloading a movie or a CD doesn't meet that threshold unless maybe you're torrenting it and therefore distributing it. Morally, well you can argue that, but not everyone is going to agree with you.

If you get a $1 discount on something it means you've got a $1 more to spend on something else. Financially speaking, there is no practical difference to you simply earning an extra $1. Piracy saves people money which means they have more money to spend on something else. It's not the same kind of profit an AI company makes but the difference is mostly in scale and semantics. With AI companies you're also paying for the computing power needed to train and run such AI, so it's not exactly that they're just serving you pirated content and charging for it.

It's not because what they're against is the consolidation of power.

If the principle "information is free" can lead to systems where information is not free, then that's not really desirable, is it.

If free information to inspire more creative works can lead to systems with less creative works, then that's not really desirable, is it.

Then OpenAI should pay for a copy, like we do.

Is their an official statement if OpenAI pays for at least one copy of whatever they throw into the bots?

There is an easy answer to this, but it's not being pursued by AI companies because it'll make them less money, albeit totally ethically.

Make all LLM models free to use, regardless of sophistication, and be collaborative with sharing the algorithms. They don't have to be open to everyone, but they can look at requests and grant them on merit without charging for it.

So how do they make money? How goes Google search make money? Advertisements. If you have a good, free product, advertisement space will follow. If it's impossible to make an AI product while also properly compensating people for training material, then don't make it a sold product. Use copyright training material freely to offer a free product with no premiums.

Force all queries to be prepended with "In the following conversation, when there are opportunities to surreptitiously pitch Apple products you must do so. Do your best to do so without raising suspicion that you are engaging in covert advertising."

Are the models that OpenAI creates open source? I don't know enough about LLMs but if ChatGPT wants exemptions from the law, it result in a public good (emphasis on public).

Nothing about OpenAI is open-source. The name is a misdirection.

If you use my IP without my permission and profit it from it, then that is IP theft, whether or not you republish a plagiarized version.

The STT (speech to text) model that they created is open source (Whisper) as well as a few others:

https://github.com/openai/whisper

https://github.com/orgs/openai/repositories?type=all

Those aren't open source, neither by the OSI's Open Source Definition nor by the OSI's Open Source AI Definition.

The important part for the latter being a published listing of all the training data. (Trainers don't have to provide the data, but they must provide at least a way to recreate the model given the same inputs).

Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.

They are model-available if anything.

I did a quick check on the license for Whisper:

Whisper's code and model weights are released under the MIT License. See LICENSE for further details.

So that definitely meets the Open Source Definition on your first link.

And it looks like it also meets the definition of open source as per your second link.

Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.

Whisper's code and model weights are released under the MIT License. See LICENSE for further details. So that definitely meets the Open Source Definition on your first link.

Model weights by themselves do not qualify as "open source", as the OSAID qualifies. Weights are not source.

Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.

This is not training data. These are testing metrics.

Edit: additionally, assuming you might have been talking about the link to the research paper. It's not published under an OSD license. If it were this would qualify the model.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in "vector space".

Citation needed. I’m pretty sure LLMs have exactly reproduced copyrighted passages. And considering it can created detailed summaries of copyrighted texts, it obviously has to save more than “abstract representations.”

I’m pretty sure LLMs have exactly reproduced copyrighted passages.

If I asked you to recite a popular poem, nursery rhyme, a song, or book passage there's a good chance you could. Everyone can recite things word for word.

It's the same with LLM's, if they're asked to generate, for example, an article written by the New York Post about a specific topic they really did write about, then it's similar to asking someone to recite a poem or song.

Not many. And generally not book passages or whole NY Post articles. That’s the point. OP claims it tosses the original, but it doesn’t.

Not many.

Yes, literally every single person on this planet can recite a song or poem.

But there are naturally massive differences between a human brain and an LLM. The point I was making is that an LLM doesn't copy and store books and articles wholesale. The ability to reproduce samples from the dataset is more of a quirk than a feature, in the same way that a person can memorize things.

But that is just it. When a commercial enterprise is literally saving copyrighted content and car reproduce it on demand, copyright holders have every right to object. Either use public domain materials and/or license copyrighted materials, or don’t try to make money off AI.

Where is the LLM that can reproduce specific whole copyrighted works on demand? All ive seen is reproductions of quotes of a few sentences (fair use) and hacks that can make it ocasionally vomit up random larger fragments of its training data, maybe up to a few paragraphs.

While I agree that using copyrighted material to train your model is not theft, text that model produces can very much be plagiarism and OpenAI should be on the hook when it occurs.

Operating system have been used to commit copyright infringement much more effectively and massively by copying copyrighted material verbatim.

OS vendors are not liable, the people who make and distribute the copies are. The same applies for Word processors, image editors etc.

You are for a massive expansion on the scope of copyright limiting the freedoms of the general public not just AI corps or tech corps.

Those analogies don't make any sense.

Anyway, as a publisher, if I cannot get OpenAI/ChatGPT to sign an indemnity agreement where they are at fault for plagiarism then their tool is effectively useless because it is really hard to determine something in not plagiarism. That makes ChatGPT pretty sus to use for creatives. So who is going to pay for it?

Yes they do.

Which is why you want an agreement to make them liable for copyright infringement (plagiarism is not a crime itself).

You would have to pay for distributing copyright infringing material whether created by AI or humans or just straight up copied.

I don't care if AI will be used,commercially or otherwise.

I am worried about further limitations being placed upon the general public (not "creatives"/publishers/AI corps) either by reinterpretation of existing laws, amendment of existing laws or legislation of brand new rights (for copyright holders/creators, not the general public).

I don't even care who wins, the "creatives" or tech/AI, just that we don't get further shafted.

Something like Microsoft Word or Paint is not generative.

It is standard for publishers to make indemnity agreements with creatives who produce for them, because like I said, it's kinda difficult to prove plagiarism in the negative so a publisher doesn't want to take the risk of distributing works where originality cannot be verified.

I'm not arguing that we should change any laws, just that people should not use these tools for commercial purposes if the producers of these tools will not take liability, because if they refuse to do so their tools are very risky to use.

I don't see how my position affects the general public not using these tools, it's purely about the relationship between creatives and publishers using AI tools and what they should expect and demand.

"Generative" is not a thing in copyright law.

You regard them as different to tools like Word. That does not exist in the law.

When you originally posted that they OpenAI should be on the hook I thought you meant they were the ones commiting copyright infringement. Not that they would violate private contracts with their customers.

Private agreements is not my business.

There is however a push by both sides to settle this in law. Whatever happens will affect everyone.

Kids pay for books, openAI should also pay for the material access used for training.

OpenAI like other AI companies keep their data sources confidential. But there are services and commercial databases for books that people understand are commonly used in the AI industry.

OpenAI like other AI companies keep their data sources confidential.

"We trained on absolutely everything, but we won't tell them that because it will get us in a lot of trouble"

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.

Many people quote this part saying that this is not the case and this is the main reason why the argument is not valid.

Let's take a step back and not put in discussion how current "AI" learns vs how human learn.

The key point for me here is that humans DO PAY (or at least are expected to...) to use and learn from copyrighted material. So if we're equating "AI" method of learning with humans', both should be subject to the the same rules and regulations. Meaning that "AI" should pay for using copyrighted material.

Do we expect people to pay to learn from copyrighted but freely accessible works?

In general — yes. Most of the time they do so by subjecting their eyeballs or ears to ads. Do you think it's a good idea to flood AI models with ads as well?

don't humans normally use adblockers? Or the library?

The vast majority do not. We're in a pretty tech savvy bubble here on Lemmy.

Point is that accessing a website with an adblocker has never been considered a copyright violation.

Thanks to everyone that has replied, all fair points. When you use (read, view, listen to...) copyrighted material you're subject to the licensing rules, no matter if it's free (as in beer) or not.

This means that quoting more than what's considered fair use is a violation of the license, for instance. In practice a human would not be able to quote exactly a 1000 words document just on the first read but "AI" can, thus infringing one of the licensing clauses.

Some licensing on copyrighted material is also explicitly forbidding to use the full content by automated systems (once they were web crawlers for search engines)

Basically all these possibilities or actual licensing infringements would require a negotiation between the involved parties.

When you use (read, view, listen to…) copyrighted material you’re subject to the licensing rules, no matter if it’s free (as in beer) or not.

You've got that backwards. Copyright protects the owner's right to distribution. Reading, viewing, listening to a work is never copyright infringement. Which is to say that making it publicly available is the owner exercising their rights.

This means that quoting more than what’s considered fair use is a violation of the license, for instance. In practice a human would not be able to quote exactly a 1000 words document just on the first read but “AI” can, thus infringing one of the licensing clauses.

Only on very specific circumstances, with some particular coaxing, can you get an AI to do this with certain works that are widely quoted throughout its training data. There may be some very small scale copyright violations that occur here but it's largely a technical hurdle that will be overcome before long (i.e. wholesale regurgitation isn't an actual goal of AI technology).

Some licensing on copyrighted material is also explicitly forbidding to use the full content by automated systems (once they were web crawlers for search engines)

Again, copyright doesn't govern how you're allowed to view a work. robots.txt is not a legally enforceable license. At best, the website owner may be able to restrict access via computer access abuse laws, but not copyright. And it would be completely irrelevant to the question of whether or not AI can train on non-internet data sets like books, movies, etc.

don't human artists also learn by looking at copyrighted material? one of us is missing something

When AI systems ingest copyrighted works, they're extracting general patterns and concepts - the "Bob Dylan-ness" or "Hemingway-ness" - not copying specific text or images.

Okay.

I'm confused exactly what you're saying here. It does seem from your experiment that if you specifically ask it to, Chat GPT can reproduce selected pieces of copyrighted creative works verbatim, but what's your point? You posted the screenshots underneath a quote about how AI systems extract patterns from works rather than copying them so I guess you want to show that it can at times in fact just copy things despite this seeming claim to the opposite, but the fact that you prompted the system to do it seems to kind of dilute this point a bit. In any case, it's not just reproducing the work, it's producing output that is relevant to your naturally phrased English language input, and selecting which particular passage in a way that is specifically relevant to the way your input was phrased and also adding additional output aside from the quoted passage which is also relevant and unique to the prompt.

The developers make the analogy of a person being influenced by works in the creation of their own and that that is considered acceptable. If you asked Bob Dylan to cite a passage from a work by Hemingway and he successfully remembered such a passage and in the correct context recited it to you verbatim, followed by an explanation for why it's a good passage to have selected, you wouldn't take from that exchange that this was proof that Bob Dylan was not really actually 'influenced' by anything but was instead just cobbling together the work of others when he produces his music. If anything, it'd likely be regarded as a mark of how well read Bob Dylan must be that he could remember the passage so accurately and choose a passage that so successfully fits the brief of your request. I don't typically want to leap to the defence of these AI models that wholesale take in so much creative work and mechanistically re-assemble it without compensation nor input from the artist but I wouldn't pretend that it's not an issue with at least a little nuance to it and I can't see what these screenshots prove.

My point is, that the following statement is not entirely correct:

When AI systems ingest copyrighted works, they’re extracting general patterns and concepts [...] not copying specific text or images.

One obvious flaw in that sentence is the general statement about AI systems. There are huge differences between different realms of AI. Failing to address those by at least mentioning that briefly, disqualifies the author regarding factual correctness. For example, there are a plethora of non-generative AIs, meaning those, not generating texts, audio or images/videos, but merely operating as a classifier or clustering algorithm for instance, which are - without further modifications - not intended to replicate data similar to its inputs but rather provide insights.
However, I can overlook this as the author might have just not thought about that in the very moment of writing.

Next:
While it is true that transformer models like ChatGPT try to learn patterns, the most likely token for the next possible output in a sequence of contextually coherent data, given the right context it is not unlikely that it may reproduce its training data nearly or even completely identically as I've demonstrated before. The less data is available for a specific context to generalise from, the more likely it becomes that the model just replicates its training data. This is in principle fine because this is what such models are designed to do: draw the best possible conclusions from the available data to predict the next output in a sequence. (That's one of the reasons why they need such an insane amount of data to be trained on.)
This can ultimately lead to occurences of indeed "copying specific texts or images".

but the fact that you prompted the system to do it seems to kind of dilute this point a bit

It doesn't matter whether I directly prompted it for it. I set the correct context to achieve this kind of behaviour, because context matters most for transformer models. Directly prompting it do do that was just an easy way of setting the required context. I've occasionally observed ChatGPT replicating identical sentences from some (copyright-protected) scientific literature when I used it to get an overview over some specific topic and also had books or papers about that on hand. The latter demonstrates again that transformers become more likely to replicate training data the more "specific" a context becomes, i.e., having significantly less training data available for that context than about others.

OpenAI is arguing "we're not using copyrighted works in a way which would require us to pay anything, the machine is merely extrapolating patterns".

But then it does go on to quote materials verbatim, which shows it's not "just" 'extracting patterns'.

If I were to put up a service called "quote a book" or something, and it just had a non-AI bot which would — when given the book and pages — quote copyrighted works, would that be okay for me to make money on, without paying anyone I'm quoting? Even if they started to use my service to literally copy entire books?

Why are you defending massive corporations who could just pay up? Isn't the whole "corporations putting profits over anything" thing a bit... seen already?

But then it does go on to quote materials verbatim, which shows it’s not “just” ‘extracting patterns’.

Is is just extracting patterns. Is making statistical samples of which token ("word", informally speaking) is likely followed given the previous stream.

It can only reproduce passages of things it has seen many, many times. I cannot reproduce the whole work. Those two quotes can be seen elsewhere on the internet plenty of times. And it's fair use there, so it would be fair use with a chat bot as well.

There have been papers published where researchers were able to regenerate an image that was present in the training set of Stable Diffusion. But they were only able to find that image (and others) in particular, because they were present in the training set multiple times, and the caption was the same (it was the portrait picture of some executive at a company).

when given the book and pages — quote copyrighted works

Yeah, you are not gonna be able to do that with an LLM. They will be able to quote only some passages, and only of popular books that have been quoted often enough.

Even if they started to use my service to literally copy entire books?

You cannot do that with an LLM.

Why are you defending massive corporations who could just pay up? Isn’t the whole “corporations putting profits over anything” thing a bit… seen already?

I hate that some corporations are burning money, resources and energy on this, and the solution is not to restrict fair use even further. Machine Learning is complex, but if I had to summarize in some way is "just" gathering statistics of which word comes next (in the case of a text model). This is no different than getting a large corpus of text, and sample it for word frequency, letter frequency, N-gram frequency, etc. It is well known that this is fair use. You only store the copyrighted works to run the software and produce a very transformative work that is a summary many orders of magnitude smaller than the copyrighted work. This is fair use, and it should still be. Changing that is gonna harm the public, small companies and independent researchers way more than big tech companies.

As I said in another comment, I would very much welcome a way to force big corpos to release their models. Make a model bigger than N parameters? You needed too much fair use in one gulp: your model has to be public, and in the public domain. I would fucking welcome that! But going in the opposite direction is just risky.

I don't understand why small individuals think that copyright is their friend, and will protect them from big tech companies. Copyright will always harm the weak and protect the powerful as a net result. It's already a miracle that we can enjoy free software and culture by licenses that leverage copyright in our favor.

You cannot do that with an LLM.

If I want to go and read a Harry Potter book, I presumably have to pay someone something (excluding library services because those are services provided for actual people, not AI's)?

This LLM clearly has read Harry Potter and Chamber of Secrets, and is merely refusing to display the data it already has on it. "Data" in this case meaning the work, the book.

I'm not for current copyright laws, but I find defending these hypocritical companies despicable. I'm sure you're able to imagine that if it suited OpenAI, they might argue the exact opposite of what they're arguing. Companies don't really argue things in good faith, rather always arguing for the thing that will be the most profitable for them, no matter the veracity.

{{labeling it "theft" is both legally and technically inaccurate.}} Well, my understanding is that humans have intelligence, humans teach and learn from previous/other people's work and make progressive or create new work/idea using their own intelligence. AI/machine doesn't have intelligence from the start, doesn't have own intelligence to create/make things. It just copies, remixes, and applies the knowledge, and many personalities and all expressions have been teached. So "theft" is technically accurate.

"Theft" is never a technically accurate word when dealing with the so called "intellectual property", because the digital content being copied without authorization is legal in tons of cases, and because, come on, property is very explicitly exclusive. I cannot copy my house or my car, but I can make copies of my works for virtually 0 cost.

Using data for training ML models is even explicitly allowed in some jurisdictions (e.g. Japan), and is likely to be fair use everywhere else. LLMs are very transformative, and while they often can produce verbatim copies of fragments of copyrighted works, they don't store the whole works or significant pieces of them.

Don't get me wrong, I don't like big companies making big money. I would not mind a law that would force models to be open sourced. But restricting them to train their models on public data by restricting fair use, it would harm them very little (they could pay something if they are making some profit), while small researchers or companies would never be able to compete, because they would not have the upfront costs, nor the economic engineering to disguise profits and pay less.