I heard from a guy once who was incredulous I’d want to block the bots that scrape websites to train AI models. Why wouldn’t I want to help train the intelligence of the future? I get that perspective, I do, but shouldn’t it be up to me? The companies that are scraping my sites are for-profit and turning the access to them into paid products. And they aren’t cutting checks — they’re drinking milkshakes. Nor are they typically linking to or crediting sources.
The AI company bots say they’ll respect a robots.txt
directive, which is a file that instructs the permissions of whole-internet roving bots on a particular site. They might, but given the level of respect shown so far, I wouldn’t bet on it. I like the idea of instructing your actual web server to block the bots based on their User-Agent string. Ethan Marcotte did just this recently:
First, I polled a few different sources to build a list of currently-known crawler names. Once I had them, I dropped them into a
mod_rewrite
rule in my.htaccess
file: