Skip Navigation

Is there a simple way to severly impede webscraping and LLM data collection of my website?

I am working on a simple static website that gives visitors basic information about myself and the work I do. I want this as a way use to introduce myself to potential clients, collaborators, etc., rather than rely solely on LinkedIn as my visiting card.

This may seem sound rather oxymoronic given that I am literally going to be placing (some relevant) details about myself and my work on the internet, but I want to limit the websites' access from bots, web scraping and content collection for LLMs.

Is this a realistic expectation?

Also, any suggestions on privacy respecting, yet inexpensive domains that I can purchase in Europe would be of super great help.

39 comments
  • No, not really as they best way would be making it totally private.

    Edit: I see you edited the title. You might be able to slow down LLM training. However, your content is such a small percentage in the whole that I doubt it would matter.

    The simplest way might be to add a artificial delay to the page load. You could create a simple loading page that is just long enough to cause bots to move on. However, this will completely break search indexing assuming that this method works.

  • If you look in your access logs, or /var/log/nginx/access.log and look for user agents in the log file that indicate things like chatgptbot, etc. Then add if ($http_user_agent ~* "useragent1|useragent2|... useragents") { return 403; } to the server block of your websites config file in /etc/nginx/sites-enabled/. You can also add a robots.txt that forbids scraping. Chatgpt generally checks and respects that... for now. This paired with some of the stuff above should work.

39 comments