Revealed: The Authors Whose Pirated Books Are Powering Generative AI

Pete Hahnloser@beehaw.org to

Technology@beehaw.org – 96 points – 11 months ago

Revealed: The Authors Whose Pirated Books Are Powering Generative AI

theatlantic.com

Archive link

You are viewing a single comment

View all comments Show the parent comment

Clearly transformative only applies to the work a human has put in to the process. It isn't at all clear that an LLM would pass muster for a fair use defense, but there are court cases in progress that may try to answer that question. Ultimately, I think what it's going to come down to is whether the training process itself and the human effort involved in training the model on copyrighted data is considered transformative enough to be fair use, or doesn't constitute copying at all. As far as I know, none of the big cases are trying the "not a copy" defense, so we'll have to see how this all plays out.

In any event, copyright laws are horrifically behind the times and it's going to take new legislation sooner or later.

My bet is: it's going to depend on a case by case basis.

A large enough neural network can be used to store, and then recover, a 1:1 copy of a work... but a large enough corpus can contain more data that could ever be stored in a given size neural network, even if some fragments of the input work could be recovered... so it will depend on how big of a recoverable fragment is "big enough" to call it copyright infringement... but then again, reproducing up to a whole work is considered fair use for some purposes... but not in every country.

Copyright laws are not necessarily wrong; just remove the "until author's death plus 70 years" coverage, go back to a more reasonable "4 years since publication", and they make much more sense.

My bet is: it’s going to depend on a case by case basis.

Almost certainly. Getty images has several exhibits in its suit against Stable Diffusion showing the Getty watermark popping up in its output as well as several images that are substantially the same as their sources. Other generative models don't produce anything all that similar to the source material, so we're probably going to wind up with lots of completely different and likely contradictory rulings on the matter before this gets anywhere near being sorted out legally.

Copyright laws are not necessarily wrong; just remove the “until author’s death plus 70 years” coverage, go back to a more reasonable “4 years since publication”, and they make much more sense.

The trouble with that line of thinking is that the laws are under no obligation to make sense. And the people who write and litigate those laws benefit from making them as complicated and irrational as they can get away with.

In this case the Mickey Mouse Curve makes sense, just bad sense. At least the EU didn't make it 95 years, and compromised on also 70... 🙄

I agree with that. And you're right that it's currently in the hands of the courts. I'm not a copyright expert and I'm sure there are nuances I don't grasp - I didn't know fair use requires specifically human transformation if that is indeed the case. We'll just have to see in the end whose layman's interpretation turns out to be correct. I just enjoy the friendly, respectful collective speculation and knowledge sharing.