Skip to content

An effective bot blocking list that can be dragged and dropped into .htaccess. There's robots.txt guidance, too, though that is mostly to just not rely on robots.txt!

License

Notifications You must be signed in to change notification settings

codewordcreative/bot-block-list

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bot block list - what it says on the tin

I've been collating various bots worth blocking from various sources (listed below) and compiling them in .htaccess rules. The entries/rules in the list can be applied to non-Apache servers - it's just regex so human-readable. I've made extensive comments to help people decide what to implement and how.

Collaboration welcomed!

If anyone would like to contribute - be it corrections, additions, or any other assistance - please do. Every time I update my list, there are always some big baddies I need to add in.

Key principles

Why block? What to block?

All crawled content has an immediate impact: data transfer. But there's often a lot more impact even beyond that.

Controversial choice: archive.org

I currently have this user agent blocked - but consider taking it out. The reason is that Reddit (an unreliable source) says they found LLMs are training from the Wayback Machine's archives.

Crawlers (user agents, in this context) I'd never block

  • Legitimate search engines (Google, Bing, Yandex, DuckDuckGo, etc...)
  • Any crawler that is unlikely to act without user initiation
  • Security-related tools ensuring websites are safe to visit
  • Any platform looking for Open Graph or similar information when a link is shared on their platform
  • Other legitimate purposes, e.g. tools ensuring proper website function

Example bots that others may flag which I allow or otherwise do not list

  • qualifiedbot is not blocked, as it is designed to only ever be used with the webmaster's consent - it is a paid-for service.
  • anything with "spider" or something similar, shorter in the name - it will be blocked by the general rule.

Data collection and processing

Automated data collection at scale is sometimes helpful, but often not.

  • Where data is stored and referenced, such as that has an additional impact each time it is referenced.
  • Where data is stored, referenced, and used by a third party (most often for marketing purposes), this has an unknown additional impact.
    • In the case of marketing and email scrapers, it may lead to spam emails.
    • Personal data may be stored, processed, exposed to and ultimately used by nefarious third parties.

Unused SEO tools

Many competing SEO companies crawl the same content, hunting for backlinks, looking for content placement opportunities, or benchmarking the competition. This is inherently wasteful if the data does not benefit you.

  • Where you are not using this tool, it's pointless.
  • Where your competitors are using this tool, it is more likely to disadvantage you.
  • It will always use bandwidth.

Training large language models (LLMs) and content generation (generative AI)

Data storage and processing for LLMs and its later referencing to create content has impacts on content originators, society, and the environment.

  • Data collected is stored and/or processed to train models, meaning that beyond the storage or training process, it enlarges the model itself - ordinarily increasing the impact each time the model is used.
  • Copyright is directly violated in the act of training, regardless if visible to the human eye or in data analysis of anything later generated by that model.
  • Theft and imitation devalues creativity, makes it harder for creatives to be duly recognised for their originality and talent, and threatens creativity and culture itself.
  • Theft and imitation devalues academia, scientists, doctors, programmers, translators, and every other skilled profession out there in a similar way, with analogous impacts.
  • Promotion of AI encourages overuse and dependence by design, slowing the development of real skills of currently free systems that will eventually be exclusively pay-to-play.
  • Every use has scaling negative environmental impact. A lack of industry transparency means we cannot be sure what that is. But we've seen the data centres and power plants spring up.

How to block?

Assuming you are on an Apache server (e.g., WordPress and many other setups will usually be using Apache), adapt my rules as per your preferences (anything starting with a # is a comment and can be deleted!) and put them in your .htaccess file, near the top.

Good sources used to create and update this

First off, I used my own server logs and a special script to track user agents. This is how I found the suspicious, outdated user agents and many of the bots listed. After I'd been using that for a while, enhancing it with various useful sources, I discovered 8G Firewall. That team - Perishable Press - has been doing very similar things in a very similar way, but with some additional general rules thrown in, so...

Special mention: Perishable Press and 8G Firewall

I integrated some of the ideas and a couple of agents I had missed, but larger or more commonly hit sites may want to integrate more.

  • Just check it out: Perishable Press Ultimate Block list
    • Please note: The suggested robots.txt may not process properly, and is, to be honest, pointless, because most bots on it will ignore robots.txt.
    • Their list may diverge from mine. I haven't checked so recently. I did notice one or two "only when requested" AI crawlers are on there, such as QualifiedBot.
    • If you see things missing from mine, let me know. If you see things missing from theirs, they probably would appreciate the hint, too!
  • The full 8G firewall is worth checking out
    • I don't use the lot, but it's an excellent basis for great .htaccess security.
  • Check out their paid plugins
    • I don't use them, but it feels more than fair to link to them given how much they are doing for free.

Bot lists

These can provide a great starting point, though I should emphasise that I tend to check out each bot manually, and try to take at least some sort of look at some bots in other categories. For example, some SEO tools may also be used to crawl and train LLMs to generate content, while some miscellaneous or unclassified bots may actually be not so hard to identify as sufficiently suspicious.

What about Robots.txt?

A small number of AI agencies will observe robots.txt, but many don't. Robots.txt is still necessary as it is the only way to block Google's AI training without touching your ability to rank in Google Search or AI summaries (yep, appearing in Google's AI summaries doesn't require you to let yourself get scraped to train Gemini). I'm including my robots.txt, but please understand it is not very effective.

  • Think of robots.txt as saying "Please don't attend this house party", but they can crash it anyway, with no extra effort.
  • Think of .htaccess or other rules using user agent IDs as putting a bouncer on the door to check IDs:
    • Fake IDs are broadly available, but:
      • Script kiddies often just borrow that same person's ID, so they are easy to block.
      • A lot of people just forget to update their ID, or bring one to the club.
      • Anyone from a larger company generally obeys company policy of using the same ID every time. Other systems exist, but they use a lot more energy to run, can slow performance, and hurt user experience.

Not sure why I blocked this particular text string or bot?

First, try googling it. See if the results on there give you an answer. Still not sure? Reach out to me. I can look into it. Let me know if something needs fixing!

Differences to Cloudflare or other tools

The block list

Cloudflare's list does not go anything like as far. They also block some verified AI crawler bots that only run on customer sites, such as QualifiedBot. They do not block a lot of major scrapers and AI crawlers.

The bot challenge tools, e.g. Cloudflare or Anubis

These are fundamentally different. These take (a lot) more energy to run, but will be significantly more effective where this extra layer is required.

The broader IP blocking tools (firewalls), e.g. Cloudflare, Wordfence, or similar

Totally different approaches. These generally monitor changing IP addresses, but usually require a database or IP lookup - so are less efficient by nature. You can definitely use both - these rules processed higher up in the chain, on the server, as well as focusing on more than "just" security. The security rules, too, however, will avoid the database even needing to kick in to check something that also breaks another rule.

DISCLAIMER

The entire repo, all files, and all information within are shared with no guarantees. The robots.txt and .htaccess block lists are open source, free to use and modify (though please share your smart ideas!), no need to credit. But here's the caveat: Anything in .htaccess or even robots.txt can break your site. Even if it looks fine to you, a subtle mistake may still end up breaking key functionality or stopping some legit users from accessing your content. Use is entirely at your own risk. You MUST test thoroughly prior to and during deployment. I am not responsible for any breakage, downtime, or weird side effects.

About

An effective bot blocking list that can be dragged and dropped into .htaccess. There's robots.txt guidance, too, though that is mostly to just not rely on robots.txt!

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published