8

This huge list of known bad bots is useful (GitHub)

The list is huge with over 1,800 "known" bots, and I think it's a good thing to add this - I mean - why not?

However, my main question is - does it or can so many HTACCESS rules slow the site down?

Henry
  • 415
  • 4
  • 8
  • 7
    One reason why not: that list doesn't seem very well vetted. I wrote a link checker bot for open directory project called "TulipChain." It is low volume: it only requests URLs that are in the open directory, only requests them when run manually. Back when the ODP was active, blocking the bot would have potentially gotten your listing removed from the open directory and hurt your SEO. Yet that bot is on the list as a "bad_bot". >:( – Stephen Ostermiller Jul 18 '22 at 17:11
  • 2
    What makes these "bad bots", apart from having less market share than google? Also several of these look like download managers, not bots. – CodesInChaos Jul 19 '22 at 07:57
  • 3
    The list also seems to include the default user-agent prefixes of several common HTTP client libraries (such as LWP and python-urllib), so it will block any bot written using those libraries unless the library is configured to lie about what it is, and thus forces authors of even "good" bots to use misleading user-agent headers. Which, to be fair, is kind of a lost battle already — over 99% of all user-agent strings begin with "Mozilla/5.0" just for this reason — but it's still not something to encourage. – Ilmari Karonen Jul 19 '22 at 08:58
  • That list is also badly implemented. There are a lot of redundant rules and the regex used is inefficient. (It's also using Apache 2.2 syntax, not Apache 2.4 - so implies it is quite old.) – MrWhite Jul 26 '22 at 15:46

1 Answers1

8

It may have some impact, but it is usually very low (if properly configured). Here's a StackOverflow answer for a user who claimed in their experiments it took 10-12ms for a 1500 line file.

Some things to consider:

  • Timings will differ depending on your servers' resources (CPU speed, memory, load). You should test to be sure.
  • It's best not to put them in htaccess but in the httpd.conf file directly. It's the way recommended by Apache, as it performs better.
  • If your server has a 3rd party firewall (like Cloudflare), you can configure the rules there. They have insanely fast servers that you can take advantage of before they reach your server, and you configure rules from their web interface.
Mike Ciffone
  • 6,360
  • 5
  • 37