A lawsuit claims Google has been 'secretly stealing everything ever created and shared on the internet by hundreds of millions of Americans' to train its AI

Technology@lemmy.world – 257 points – 1 years ago

A lawsuit claims Google has been 'secretly stealing everything ever created and shared on the internet by hundreds of millions of Americans' to train its AI

A lawsuit claims Google took people's data without their knowledge or consent to train its AI products, including chatbot Bard.

If you own a web site and believe that it is "stealing" for AI bots to read your site's content and learn from it, do you also believe that search engine indexing is "stealing"? Search engine indexing involves the search engine bot downloading all the public content of your site and building a model (the index) from it. That is how it's possible for search engine users to find your site.

If you do believe search engine indexing is "stealing", have you blocked Googlebot, Bingbot, BaiduSpider, DuckDuckBot, YandexBot, etc. in your robots.txt?

"Publishing" means making public.

If you write a book, you own the copyright to the book. But the fact that the text of your book contains a particular word, e.g. the word "mesothelioma", is a public fact. You don't own that fact.

A search engine for book content can read your book, and record the fact that it contains the word "mesothelioma" in its model; and then when someone searches for that word, it can return a link to your book.

Creating the index meant that the search engine internally made a copy of the text of your book. However, serving search results is not a copyright infringement; rather, it is stating the true fact that your book contains that word.

Similarly, if you write a book about how asbestos causes mesothelioma, that fact is not your property. If someone borrows your book from the library, reads it, and learns that fact, they do not owe you money. Even if they go around telling everyone about mesothelioma, they still do not owe you any money.

If they are an academic, the rules of academic publishing say that they are supposed to cite your work as a source — telling their readers that they learned something from your work. But if they don't, that's still not copyright infringement; it's plagiarism, which is not a crime but rather an offense against academic honor.

search engines point to your site though. You are getting back something. An LLM won‘t give a reference. It’s something else altogether.

And there is no „robots.txt“ to block LLM training scrapers.

Just because you publish something doesn’t imply you forfeit copyright.

Their work isn't being reproduced and sold. Seems like fair use. I hate to say it but I'm with google on this. Things would get much with these lawsuits succeeding

No, but it is being used commercially for a profit.

This seems like a situation copyright law never saw coming.

If I read a bunch of copyrighted books, and answer questions based on the knowledge I have acquired from them, I do not owe the authors anything.

TLDR: maybe it’s like a library? Libraries pay for books, even digital copies.

Presumably somebody bought a copy of the book, even if you found it on the coffee table.

This seems more like going through the trash for anything legible, reading billboards and taking free newspapers. It just happens that a lot of the stuff put out at the curb was copyrighted material. In fact, almost every website has © in the footer, so clearly the sentiment is “don’t copy my original content”, especially without credit. But if the AI is not reproducing, in whole or in part, the copyrighted material then it does seems a bit late to try to claw back value just because someone else found a way to monetize what you put out on the open web. I think that’s what’s going to have to be proven, one way or another.

Maybe another way to look at a LLM is as an enormous library, but instead of borrowing books and periodicals, as a user you are borrowing the pre-digested knowledge directly. Libraries have complex agreements in place with publishers, so that rights holders are compensated. Say what you will about these contracts, but they are a precedent. What is perhaps without precedent is how to handle the rest of the trash this library is indiscriminately gathering up.

And that's fair use. But I'm more on the side that a lot of things should be more fair use and modern content creators have ruined the internet. I would much prefer if the Sara Silverman's and others realized they cant both try to use the internet for free promotion while also preventing specific people from consuming that content. I'd love if they all left. I don't need these people as much they need us

See my edit. You own the copyright to your work but you do not own ① facts about your work, or ② facts contained in your work. The creators of reference works, for example, cannot assess a royalty fee from people who learn information from those reference works.

And there is no „robots.txt“ to block LLM training scrapers.

robots.txt is consulted by all manner of automated tools, not just search-engine indexing.

There are multiple issues at play here, on both the legal and ethical levels. But your a whole lot of wrong.

At the Legal level, the DMCA Act protects search engines and grants them protections to index websites. A for profit (!) AI does not have those same DMCA benefits (or at least, it hasn't played out in court yet).

On the ethical level, well, I want people to find my website, google wants traffic, it's a trade. An AI can be used to make content that sure as shit sounds like my website, and if I have enough "content" out there, it can even be asked to emulate your voice. The AI will be used in place of your website instead of a tool to find your website. This competition for the same click is the basis of several laws being written and coming into effect currently because Google indexing your website in near real-time and serving news and no through click is a dick move.

The law is slow to catch up, but it will probably get there.

Not to mention your comparison to a book I wrote... if I created a unique character for said book, lets say, Bucky Bouce. Then I would own the copyright on that character for a long time thanks to Disney. A machine learning model being trained on my book would not be able to differentiate between the word "mesothelioma" and the character Bucky Bouce.

If I invent a new way to cure "mesothelioma", and publish that in a scientific paper, I can still file a patent on that cure. I can then own that process for quite some time. If I'm rich enough I can even drag it through the courts to extend that protection. Looking at you Bedaquiline.

I won’t go very deep into this and will make a very simple case to understand the issue. Let’s say I decide to buy testing equipment for a certain type of devices. I run these tests and then I document my findings on a personal website. I then get remunerated for my original work either by using affiliation or ads. Or both since this is a very common way to monetize a website. Then comes Google which takes the content and shows it before the user has the chance to click and go to the website. Additionally, Bard doesn’t even reference the original work, it claims it’s its own. The consequence is that I will stop testing those devices and the Internet will lose valuable original content. And let’s not forget that Google shamelessly pushed its services above the organic websites, but that’s another whole big can of worms.

What's currently stopping a certain user by the name science_r0x_99 from going to your site and copying your data and posting it on his YouTube channel without giving credit? What's stopping the journalist Johnny Always Busy from copying that data and putting them in an article on the Daily Whatever with a tacky headline and again no credit?

I think you've just described the nature of copyright law, which is effective in some ways and ineffective in others.

That’s when the search engine usually came in and helped the original content be pushed above copycats. It’s actually very common what you said, but rarely were the content creators bothered by those that plagiarize the content. What Google seems to want to do is to stop behaving like a search engine altogether and start acting like the content thieves. In an ideal world, a new search engine/s would just push Google out and take its place. But when you’re a monopoly, that’s not really an option now is it?

This is the problem Sarah Silverman had and why she is joining in on a lawsuit. It's not just that it trained on her book, it's that if you ask it to do so, it will regurgitate passages from her book verbatim. That is why this is problematic.

That's not really problematic since if anyone online asks me to quote one of many books I can copy-paste passages verbatim and it isn't a copyright violation

Happens all the time in online communities dedicated to book discussion

But what is stopping it from regurgitating the entire book on request?

Have you actually attempted to ask an LLM to do that?

The most basic problem is that it doesn't store information in that way.

Uhh someone hasn't heard of licences

I'm gonna help you with the first paragraph. Google the definition of "consent".

"Consent" means very different things in different contexts. In some contexts, it's entirely irrelevant. An author doesn't have to grant their consent for their book to be indexed and lent out in public libraries, for example. The library buys a copy on the open market, and they can lend it out to as many people as they like ... with no further permission from the author or publisher required.

Well, seems like in this case it's relevant if publishers really don't like it when AI learns on their data.

Looks like a bullshit case. They filed the same case against open AI earlier. I'm sure there's more to come. Thing about Google is, they are painfully transparent about anything they do, they had spoke about this and released articles and research papers and yada yada yada for years now. There doesn't seem to be anything that Google can even be charged for in this case.

They're all bullshit cases from Luddites fighting a fruitless war against AI.

They've already lost, they just don't realize it yet.

Indexing a site isn't stealing from it.

Plus you can shut all that down with some simple HTML

It’s not too much of a surprise. Take a look at Google’s TOS. Anything you upload to their platform they can do as they please with it. There is not even an exception made for email.

Right. And this is beyond fucked. I agree that we shouldn’t be surprised. You have to have some kind of advanced law degree and unlimited wealth with no need to ever do anything else but read TOS if you were to ever have control over your own data.

We shouldn’t be allowed to sign away our rights to things like that—or, I should say, we should have the option not to and still be able to use the internet. We have entered into a kind of world where you have to be an engineer to build your own protected version of the internet, or you have to be a lawyer with nothing to do to use the regular one. Or you’re getting screwed.

The problem is, these rights we can click away in less than two seconds are forfeited for eternity. And with the insane leaps on technology and the complete lack of boundaries for corporations in the US, this can and probably will get ugly and way out of hand before we can even gather what’s happening. In my opinion, it already is out of hand. We just haven’t seen the worst of what they can do with this tech.

Or you can just use the internet like billions of people do, and profit from it like those billions of people do.

The Internet is not Google’s platform. The www will be here long after Google is gone, which I hope is very soon.

Lmao good luck with that hope bud.

Lmao or not, Google is still not the Internet. And I hope more people understand this objective truth.

?????