I've been collating various bots worth blocking from various sources (listed below) and compiling them in .htaccess rules. The entries/rules in the list can be applied to non-Apache servers - it's just regex so human-readable. I've made extensive comments to help people decide what to implement and how.
If anyone would like to contribute - be it corrections, additions, or any other assistance - please do. Every time I update my list, there are always some big baddies I need to add in.
All crawled content has an immediate impact: data transfer. But there's often a lot more impact even beyond that.
I currently have this user agent blocked - but consider taking it out. The reason is that Reddit (an unreliable source) says they found LLMs are training from the Wayback Machine's archives.
- Legitimate search engines (Google, Bing, Yandex, DuckDuckGo, etc...)
- Any crawler that is unlikely to act without user initiation
- Security-related tools ensuring websites are safe to visit
- Any platform looking for Open Graph or similar information when a link is shared on their platform
- Other legitimate purposes, e.g. tools ensuring proper website function
- qualifiedbot is not blocked, as it is designed to only ever be used with the webmaster's consent - it is a paid-for service.
- anything with "spider" or something similar, shorter in the name - it will be blocked by the general rule.
Automated data collection at scale is sometimes helpful, but often not.
- Where data is stored and referenced, such as that has an additional impact each time it is referenced.
- Where data is stored, referenced, and used by a third party (most often for marketing purposes), this has an unknown additional impact.
- In the case of marketing and email scrapers, it may lead to spam emails.
- Personal data may be stored, processed, exposed to and ultimately used by nefarious third parties.
Many competing SEO companies crawl the same content, hunting for backlinks, looking for content placement opportunities, or benchmarking the competition. This is inherently wasteful if the data does not benefit you.
- Where you are not using this tool, it's pointless.
- Where your competitors are using this tool, it is more likely to disadvantage you.
- It will always use bandwidth.
Data storage and processing for LLMs and its later referencing to create content has impacts on content originators, society, and the environment.
- Data collected is stored and/or processed to train models, meaning that beyond the storage or training process, it enlarges the model itself - ordinarily increasing the impact each time the model is used.
- Copyright is directly violated in the act of training, regardless if visible to the human eye or in data analysis of anything later generated by that model.
- Theft and imitation devalues creativity, makes it harder for creatives to be duly recognised for their originality and talent, and threatens creativity and culture itself.
- Theft and imitation devalues academia, scientists, doctors, programmers, translators, and every other skilled profession out there in a similar way, with analogous impacts.
- Promotion of AI encourages overuse and dependence by design, slowing the development of real skills of currently free systems that will eventually be exclusively pay-to-play.
- Every use has scaling negative environmental impact. A lack of industry transparency means we cannot be sure what that is. But we've seen the data centres and power plants spring up.
Assuming you are on an Apache server (e.g., WordPress and many other setups will usually be using Apache), adapt my rules as per your preferences (anything starting with a # is a comment and can be deleted!) and put them in your .htaccess file, near the top.
First off, I used my own server logs and a special script to track user agents. This is how I found the suspicious, outdated user agents and many of the bots listed. After I'd been using that for a while, enhancing it with various useful sources, I discovered 8G Firewall. That team - Perishable Press - has been doing very similar things in a very similar way, but with some additional general rules thrown in, so...
I integrated some of the ideas and a couple of agents I had missed, but larger or more commonly hit sites may want to integrate more.
- Just check it out: Perishable Press Ultimate Block list
- Please note: The suggested robots.txt may not process properly, and is, to be honest, pointless, because most bots on it will ignore robots.txt.
- Their list may diverge from mine. I haven't checked so recently. I did notice one or two "only when requested" AI crawlers are on there, such as QualifiedBot.
- If you see things missing from mine, let me know. If you see things missing from theirs, they probably would appreciate the hint, too!
- The full 8G firewall is worth checking out
- I don't use the lot, but it's an excellent basis for great .htaccess security.
- Check out their paid plugins
- I don't use them, but it feels more than fair to link to them given how much they are doing for free.
These can provide a great starting point, though I should emphasise that I tend to check out each bot manually, and try to take at least some sort of look at some bots in other categories. For example, some SEO tools may also be used to crawl and train LLMs to generate content, while some miscellaneous or unclassified bots may actually be not so hard to identify as sufficiently suspicious.
- Cloudflare's policy and links to the directory
- Dark Visitors has a great directory and robots.txt tool
- It has a tool for daily updates to the robots.txt.
- Robots.txt by itself isn't very useful, but it's possible the paid version goes further - I think it uses an API lookup system (not THAT efficient), but I may be wrong.
- It also has some monitoring tools that make it easier for less nerdy types to see what is happening.
- Data Dome is a security company sharing useful intel
- Many more sources can be added here!
A small number of AI agencies will observe robots.txt, but many don't. Robots.txt is still necessary as it is the only way to block Google's AI training without touching your ability to rank in Google Search or AI summaries (yep, appearing in Google's AI summaries doesn't require you to let yourself get scraped to train Gemini). I'm including my robots.txt, but please understand it is not very effective.
- Think of robots.txt as saying "Please don't attend this house party", but they can crash it anyway, with no extra effort.
- Think of .htaccess or other rules using user agent IDs as putting a bouncer on the door to check IDs:
- Fake IDs are broadly available, but:
- Script kiddies often just borrow that same person's ID, so they are easy to block.
- A lot of people just forget to update their ID, or bring one to the club.
- Anyone from a larger company generally obeys company policy of using the same ID every time. Other systems exist, but they use a lot more energy to run, can slow performance, and hurt user experience.
- Fake IDs are broadly available, but:
First, try googling it. See if the results on there give you an answer. Still not sure? Reach out to me. I can look into it. Let me know if something needs fixing!
Cloudflare's list does not go anything like as far. They also block some verified AI crawler bots that only run on customer sites, such as QualifiedBot. They do not block a lot of major scrapers and AI crawlers.
These are fundamentally different. These take (a lot) more energy to run, but will be significantly more effective where this extra layer is required.
Totally different approaches. These generally monitor changing IP addresses, but usually require a database or IP lookup - so are less efficient by nature. You can definitely use both - these rules processed higher up in the chain, on the server, as well as focusing on more than "just" security. The security rules, too, however, will avoid the database even needing to kick in to check something that also breaks another rule.
The entire repo, all files, and all information within are shared with no guarantees. The robots.txt and .htaccess block lists are open source, free to use and modify (though please share your smart ideas!), no need to credit. But here's the caveat: Anything in .htaccess or even robots.txt can break your site. Even if it looks fine to you, a subtle mistake may still end up breaking key functionality or stopping some legit users from accessing your content. Use is entirely at your own risk. You MUST test thoroughly prior to and during deployment. I am not responsible for any breakage, downtime, or weird side effects.