Skip Navigation

Could Reddit's data be "poisoned" to prevent its use in training AI?

In case you didn't know, you can't train an AI on content generated by another AI because it causes distortion that reduces the quality of the output. It is also very difficult to filter out AI text from human text in a database. This phenomenon is known as AI collapse.

So if you were to start using AI to generate comments and posts on Reddit, their database would be less useful for training AI and therefore the company wouldn't be able to sell it for that purpose.

60 comments
  • I don't think that you can prevent Reddit data from being used for AI training, but you could reduce its value. Based on that, I'd probably

    1. Generate low quality text that machines would have a hard time sorting out.
    2. Replace your current Reddit content with said gibberish.

    I'm saying this based on the following:

    • I don't think that Reddit has any sort of complex content versioning system; at most, I think that it keeps your deleted posts/comments.
    • Odds are that the data is filtered before being used for "training", and both user karma + content score play a role on that. As such, it would be pointless to add nonsense content that humans will downvote.

    Funnily enough, AI might be a good way to generate this poisoning data against AI. For example I asked Gemini "Generate three paragraphs of nonsense text, containing three sentences each.", and here's the output

    You could tweak the prompt to get something even more nonsense or even more passable, but you get the idea.

60 comments