Is there a way to protect data/user contents in Lemmy/Mastodon against now rapidly rising AI s?

xptiger@lemmy.world to Fediverse@lemmy.world – 55 points –

Though Lemmy and Mastodon are public sites, and their structures are open-source I guess? (I'm not a programmer/coder), can they really dodge the ability of AI s to collect/track any data everytime they search everywhere on Internet?

15

Radical and altogether stupid idea (but a fun thought) is this:

Were lemmy to have a certain percentage of AI content seamlessly incorporated into its corpus of text, it would become useless for training LLMs on (see this paper for more technical details on the effects of training LLMs on their own outputs, a phenomenon called "model collapse").

In effect this would sort of "poison the well", though given that we all drink the water, the hope would be that our tolerance for a mild amount of AI corruption would be higher than an LLM creator's.

This poisoning approach amusingly benefits from being a thing that could be advertised heavily, basically saying "lemmy is useless for training LLMs, don't bother with it".

Now I must say personally I think that I don't really think this is a sensible or viable strategy, and that I think the well is already poisoned in this regard (as I think there is already a non-negligible amount of LLM-sourced content on lemmy). But yes, a fun approach to consider: trading integrity for privacy.

Those "@-@ tailed jackrabbits" in your link made me laugh. Emoticons in species names? Why not?

I think that we could minimise the loss of integrity if the data is "contained" in a way that your typical user wouldn't see it but bots would still retrieve it for model training.

And we don't need to restrict ourselves to use LLM-sourced data for that. The model collapse boils down to the amount of garbage piling up over time; if we use plain garbage we can make it even worse, as long as the garbage isn't detected as such.

Yeah as an ecologist that same thing made me giggle. I suppose why not the lesser-spotted 🍆warbler :P

In terms of exposing it only to bots, that is a frustration, unless you make it seamless then it does become kinda trivial to mitigate. Otherwise the approach I'd take to mitigate it is to adapt a lemmy client that already does the filtering or reverse-engineer the deciding element of the app. Similarly if you use garbage then you need it to look enough like normal words for it to be hard to classify as AI generated.

The funny thing is that LLMs are not actually much good at telling whether something is ai generated, you need to train another model to do that, but to train that ai you need good sources of non-corrupt data. Also the whole point of generative AI language models is that they are actively trying to pass that test by design so it becomes an arms race that they can never really win!

Man, what a shitshow generative ai is

No. Lemmy and Mastodon are unrestricted by design. Assume that any post on either service is public knowledge for any company to store and reuse for whatever purpose they see fit. Edits are not guaranteed to make it and deleting comments doesn't even work within Lemmy, let alone outside on the wider Fediverse, so assume data redaction is impossible to execute perfectly.

You can try to block Threads all you want, but anyone, including Meta, can set up a server with a generic domain name, some fake user accounts, follow a couple million people and just hoover up all the posts on the Fediverse. In fact, some sketchy data broker you've never heard of (that knows where you live) is probably already doing that. You can try to set your Mastodon to require approving followers, but the server admins of any approved follower will still receive the message, and they can do with it whatever they want.

There are apps out there that leverage encryption to ensure that nothing becomes readable outside of its designated audience (barring an approved member getting hacked or turning out to be a dick), like Circles, but they don't have mainstream appeal.

If you care about privacy, ownership of your work, or tracking, leave Lemmy and Mastodon, and avoid anything implementing ActivityPub. The protocol was designed not to keep any of these things in mind.

I agree with the privacy thing. I‘m still not gonna support meta in any way shape or form. If they want to take my data, be my guest but I‘m not waiting for them to push ads down my throat.

They can put a robots.txt file in their root structure which can tell robots (AI scrapers) to ignore that website. However that only works on robots which follow that rule, it's self enforced so it's a crap shoot of it'll be followed. Otherwise to be honest there isn't a lot a public facing website can do to avoid being scraped. Maybe put up a captcha on every page?

They're a lot more resistant to it than the centralized softwares.

Stuff you post here has some small chance of remaining un-stored-forever. Obviously people can read it and store it, but it's not systematically indexed and processed like Facebook Reddit etc. Bots go around indexing the big instances, and it's fairly likely that they'll hold onto the data. Aside from that it doesn't get "centralized" anywhere. It might not be a bad idea to delete your comments after a week or two if you care about long term privacy, not that that's bulletproof, but not a bad idea.

Voting, weirdly enough, is basically public. If you're upvoting or downvoting things, more or less anyone on the network who's tech savvy can dig out the information of who voted on what. Subscriptions are also basically public.

"Reading" actions you take on Lemmy sites -- searches or viewing things -- is probably completely private. The only people who can see it are the individual instance operators, and it's legitimately unlikely that they'll ever look at it, much less hold onto the data once the logs get rotated or do anything with it aside from delete it.

So the TL;DR is it's way better here (mostly because the servers are privately operated by people who at worst, don't give a shit what you're doing, and at best would actively want to defend your privacy most likely).

Voting is done through ActivityPub because that's the only reliable way to do it. If you don't even require account names, sending a million downvotes or upvotes to a post becomes trivial. ActivityPub votes are signed with account keys so the amount of spam votes is restricted somewhat.

Lemmy 0.19 added a nifty feature to the web UI that allows server admins to see who voted what on comments. Previously, it was possible to extract that data from the database, but now any admin can just click the menu button on a comment and click "votes" for an overview.

This makes a lot of sense for manually verifying things like voting rings or butthurt people who will go through someone's profile and downvote every comment.

As for reading: Lemmy maintains a read state for posts so its "hide read" feature can do its job. It doesn't store this information about comments, and for notifications you'll have to interact with them or manually dismiss them for the read state to get updated in the database.

What are you wanting to protect? I think you should hold a bit of anonymity overall to where it doesn’t matter a whole lot. The thrill of these platforms are the more anonymous characters willing to share what’s on their mind without huge out-lash/cancelling in my opinion.

you can try and hold on to this 'light anonymity' as long as you want, but i feel the longer we all spend in the public fediverse, the easier it will be to piece together our content->id by fingerprinting... should someone decide they really want to do that.

ive kind of taken a more public approach...no, im not blathering my personal info about.. but i try not to say or do anything i wouldnt say or do to that person in public on the street in front of my house.

Just change accounts often, don’t build up a big persona fingerprint.

Going against that last part and diving into what’s really on your brain is what I’m saying sets these platforms apart. Aside from the typical bigotry spewing characters that come with it.

i spose. ive had some great conversations with people in random forums across the internet for 30 years.. the platform seems almost irrelevant in my memory, but posts are sticky. everyone knows those things can be hard to remove. i dont think the fediverse will change that.

another option for more easily anonymous, ephemeral, say-anything kinda of environment.. check IRC. ticks a lot of those boxes

True, IRC probably birthed most of my feelings in this regard. Formed a tight knit community and learned a lot of personal stuff on there as well though.

Is there a way to protect data/user contents in Lemmy/Mastodon against now rapidly rising AI s?

yes:

Don't publish it there. It's that simple.