How will the fediverse respond to AI orgs scraping Lemmy/the fediverse for training data?
Wanting to profit from AI companies hunt for training data (over and above the community that created that data) is a big part of what created the context for the recent migration away from Reddit. How will the fediverse approach this problem?
The way that works is they make a complete copy of all the public content on the site — anything that a non-logged-in user can see — and then use that for indexing. Googlebot, BaiduSpider, Bingbot, DuckDuckBot, etc. simply copy the public data from your site onto those companies' own servers.
Once they've done that, they can do anything with that data, without further interaction with your site.
That includes using it for ML/AI training.
You cannot technologically prevent that without becoming invisible to search engine indexing. That means not being public on the web.
Your choice. You can't both be public and not public. You can't be both indexable and not indexable.
Public federation requires being public. Which thereby requires being indexable, which thereby means everything written here can be ingested into training pipelines.
That's simply true. It's not good or bad; it's just true. Your alternative is to not post your words on the public web.