Question related to Lemmy and AI , I want to know if there are solid implementations or plans to combat the massive AI crawls and protect lemmy users data. Like post, comments and even the users info

/0@lemmy.dbzer0.com – 31 points – 4 months ago

You are viewing a single comment

I hate to break it to you, but federated services are basically impossible to protect from scraping. The whole idea is openness and federation.

The only reason why places like Twitter and Reddit try to prevent scraping is so they can sell the data for profit.

If you post stuff publicly anywhere it will be scraped. On the fediverse it will be scraped via the open and federated APIs. On proprietary platforms it will be scraped via the proprietary paid APIs.

Another question related to your answer : how can I guarantee that the content I create (comments) are available for scraping ?

The issue I have with Reddit and all is that we can't freely access to the content, especially the past content. I don't want instances to be sold in like 10 years, compromising access to old content (or with advertising in them). I would like to be able to replicate one rogue instance into a new free instance.

its the wild west right now in the fediverse.

a multitude of products are being created right now. most havent hit version 1.0 yet. there are no guarantees other than what you get as assurances from your community instance/implementation.

the only solid guarantee you will ever get would be by creating your own instance so you can curate your own content (as well as the content pulled in from the 'verse).

it took reddit 20+ years to get where it is. lets give the fediverse a little time.

I want to make a distinction between scraping and archiving here.

You don’t need to do anything to ensure your content is “scrapeable”. Just post your content on the fediverse and it is available to scrape. Anyone can do it. This being said unless someone goes out of their way to save what they scrape eventually as your content ages the only copy will be on the server that it originates from. I believe all posts are stored on the instance where the community lives. I believe all comments are the same the difference being that your instance also stores a local copy of your comment. I could be wrong there though.

Archiving is different. Archiving is providing a long term store of your content. That is harder. If you run your own instance the comments you put on the communities that live on your instance are safe. Anywhere else, you are subject to that instance just dying or selling out. You would need a specialized tool to take a “snapshot” or something. Maybe adding the post thread to archive.org could work. It’s messy in any case.