I heard from a guy once who was incredulous I’d want to block the bots that scrape websites to train AI models. Why wouldn’t I want to help train the intelligence of the future? I get that perspective, I do, but shouldn’t it be up to me? The companies that are scraping my sites are for-profit and turning the access to them into paid products. And they aren’t cutting checks — they’re drinking milkshakes. Nor are they typically linking to or crediting sources.
The AI company bots say they’ll respect a robots.txt
directive, which is a file that instructs the permissions of whole-internet roving bots on a particular site. They might, but given the level of respect shown so far, I wouldn’t bet on it. I like the idea of instructing your actual web server to block the bots based on their User-Agent string. Ethan Marcotte did just this recently:
First, I polled a few different sources to build a list of currently-known crawler names. Once I had them, I dropped them into a
mod_rewrite
rule in my.htaccess
file:
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteBase /
# block “AI” bots
RewriteCond %{HTTP_USER_AGENT} (AdsBot-Google|Amazonbot|anthropic-ai|Applebot|AwarioRssBot|AwarioSmartBot|Bytespider|CCBot|ChatGPT|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|DataForSeoBot|Diffbot|FacebookBot|FacebookBot|Google-Extended|GPTBot|ImagesiftBot|magpie-crawler|omgili|Omgilibot|peer39_crawler|PerplexityBot|YouBot) [NC]
RewriteRule ^ – [F]
</IfModule>
Code language: HTML, XML (xml)