{"id":1810,"date":"2024-04-23T08:29:32","date_gmt":"2024-04-23T14:29:32","guid":{"rendered":"https:\/\/frontendmasters.com\/blog\/?p=1810"},"modified":"2024-04-23T08:29:33","modified_gmt":"2024-04-23T14:29:33","slug":"blocking-ai-bots","status":"publish","type":"post","link":"https:\/\/frontendmasters.com\/blog\/blocking-ai-bots\/","title":{"rendered":"Blocking AI Bots"},"content":{"rendered":"\n<p>I heard from a guy once who was incredulous <a href=\"https:\/\/chriscoyier.net\/2023\/09\/19\/blocking-ai-scraper-bots\/\">I&#8217;d want<\/a> to block the bots that scrape websites to train AI models. Why <em>wouldn&#8217;t<\/em> I want to help train the intelligence of the future? I get that perspective, I do, but shouldn&#8217;t it be up to me? The companies that are scraping my sites are for-profit and turning the access to them into paid products. And they aren&#8217;t cutting checks \u2014 they&#8217;re drinking milkshakes. Nor are they typically linking to or crediting sources.<\/p>\n\n\n\n<p>The AI company bots <em>say<\/em> they&#8217;ll respect a <code>robots.txt<\/code> directive, which is a file that instructs the permissions of whole-internet roving bots on a particular site. They might, but given the level of respect shown so far, I wouldn&#8217;t bet on it. I like the idea of instructing <em>your actual web server<\/em> to block the bots based on their User-Agent string. <a href=\"https:\/\/ethanmarcotte.com\/wrote\/blockin-bots\/\">Ethan Marcotte did just this recently<\/a>: <\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>First, I polled a&nbsp;<a href=\"https:\/\/darkvisitors.com\/\">few<\/a>&nbsp;<a href=\"https:\/\/neil-clarke.com\/block-the-bots-that-feed-ai-models-by-scraping-your-website\/\">different<\/a>&nbsp;<a href=\"https:\/\/coryd.dev\/posts\/2024\/go-ahead-and-block-ai-web-crawlers\/\">sources<\/a>&nbsp;to build a list of currently-known crawler names. Once I had them, I dropped them into&nbsp;<a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Learn\/Server-side\/Apache_Configuration_htaccess#mod_rewrite_and_the_rewriteengine_directives\">a&nbsp;<code>mod_rewrite<\/code>&nbsp;rule in my&nbsp;<code>.htaccess<\/code>&nbsp;file<\/a>:<\/p>\n<\/blockquote>\n\n\n\n<!--more-->\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-1\" data-shcb-language-name=\"HTML, XML\" data-shcb-language-slug=\"xml\"><span><code class=\"hljs language-xml\"><span class=\"hljs-tag\">&lt;<span class=\"hljs-name\">IfModule<\/span> <span class=\"hljs-attr\">mod_rewrite.c<\/span>&gt;<\/span>\nRewriteEngine on\nRewriteBase \/\n\n# block \u201cAI\u201d bots\nRewriteCond %{HTTP_USER_AGENT} (AdsBot-Google|Amazonbot|anthropic-ai|Applebot|AwarioRssBot|AwarioSmartBot|Bytespider|CCBot|ChatGPT|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|DataForSeoBot|Diffbot|FacebookBot|FacebookBot|Google-Extended|GPTBot|ImagesiftBot|magpie-crawler|omgili|Omgilibot|peer39_crawler|PerplexityBot|YouBot) &#91;NC]\nRewriteRule ^ \u2013 &#91;F]\n<span class=\"hljs-tag\">&lt;\/<span class=\"hljs-name\">IfModule<\/span>&gt;<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-1\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">HTML, XML<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">xml<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>","protected":false},"excerpt":{"rendered":"<p>I heard from a guy once who was incredulous I&#8217;d want to block the bots that scrape websites to train AI models. Why wouldn&#8217;t I want to help train the intelligence of the future? I get that perspective, I do, but shouldn&#8217;t it be up to me? The companies that are scraping my sites are [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1819,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"sig_custom_text":"","sig_image_type":"featured-image","sig_custom_image":0,"sig_is_disabled":false,"inline_featured_image":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[29],"tags":[164,104,163],"class_list":["post-1810","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-the-beat","tag-htaccess","tag-ai","tag-robots-txt"],"acf":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/frontendmasters.com\/blog\/wp-content\/uploads\/2024\/04\/robot-thumb.jpg?fit=1000%2C500&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/frontendmasters.com\/blog\/wp-json\/wp\/v2\/posts\/1810","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/frontendmasters.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/frontendmasters.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/frontendmasters.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/frontendmasters.com\/blog\/wp-json\/wp\/v2\/comments?post=1810"}],"version-history":[{"count":4,"href":"https:\/\/frontendmasters.com\/blog\/wp-json\/wp\/v2\/posts\/1810\/revisions"}],"predecessor-version":[{"id":1820,"href":"https:\/\/frontendmasters.com\/blog\/wp-json\/wp\/v2\/posts\/1810\/revisions\/1820"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/frontendmasters.com\/blog\/wp-json\/wp\/v2\/media\/1819"}],"wp:attachment":[{"href":"https:\/\/frontendmasters.com\/blog\/wp-json\/wp\/v2\/media?parent=1810"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/frontendmasters.com\/blog\/wp-json\/wp\/v2\/categories?post=1810"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/frontendmasters.com\/blog\/wp-json\/wp\/v2\/tags?post=1810"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}