• We value your experience with Plesk during 2024
    Plesk strives to perform even better in 2025. To help us improve further, please answer a few questions about your experience with Plesk Obsidian 2024.
    Please take this short survey:

    https://pt-research.typeform.com/to/AmZvSXkx
  • The Horde webmail has been deprecated. Its complete removal is scheduled for April 2025. For details and recommended actions, see the Feature and Deprecation Plan.
  • We’re working on enhancing the Monitoring feature in Plesk, and we could really use your expertise! If you’re open to sharing your experiences with server and website monitoring or providing feedback, we’d love to have a one-hour online meeting with you.

Issue Default plesk-apache-badbot fail2ban doesn't work

John41

New Pleskian
Server operating system version
Debian 11
Plesk version and microupdate number
v18.0.59
Hello,

I think I have a setting problem with Fail2ban apache-badbot filter because I unfortunately have attacks of this type, and Fail2ban does not ban any IP.

The default setting is:
[Definition]
badbotscustom = EmailCollector|WebEMailExtrac|TrackBack/1\.02|sogou music spider|(?:Mozilla/\d+\.\d+ )?Jorgee
badbots = Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|Guestbook Auto Submitter|Industry Program 1\.0\.x|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|LARBIN-EXPERIMENTAL \(efp@gmx\.net\)|LetsCrawl\.com/1\.0 \+http\://letscrawl\.com/|Lincoln State Web Browser|LMQueueBot/0\.2|LWP\:\:Simple/5\.803|Mac Finder 1\.0\.xx|MFC Foundation Class Library 4\.0|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|Mo College 1\.9|MVAClient|Mozilla/2\.0 \(compatible; NEWT ActiveX; Win32\)|Mozilla/3\.0 \(compatible; Indy Library\)|Mozilla/3\.0 \(compatible; scan4mail \(advanced version\) http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 \(compatible; Advanced Email Extractor v2\.xx\)|Mozilla/4\.0 \(compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at\)|Mozilla/4\.0 \(compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx\)|NameOfAgent \(CMS Spider\)|NASA Search 1\.0|Nsauditor/1\.x|PBrowse 1\.4b|PEval 1\.4b|Poirot|Port Huron Labs|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|RSurf15a 41|RSurf15a 51|RSurf15a 81|searchbot admin@google\.com|ShablastBot 1\.0|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 \(Snap Shots, \+http\://www\.snap\.com\)|sogou develop spider|Sogou Orion spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sogou spider|Sogou web spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sohu agent|SSurf15a 11 |TSurf15a 11|Under the Rainbow 2\.2|User-Agent\: Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.1\)|VadixBot|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
ignoreregex =
datepattern = ^[^\[]*\[({DATE})
{^LN-BEG}


I've had a robot crawling my site with codes 200 and 301 for over half an hour, overloading my server. It's the same IP sending a dozen requests to many different pages in the same second.



Is it possible to limit this type of request by modifying the Fail2ban apache-badbot settings (is this the correct jail?)?



Thank you for your answers!
 
You're not entirely incorrect on this -

I had to modify the failregex on the default one for it to work correctly.
 
As a quick update to this post, the regex above is a bit too greedy since it would effectively match results in the URL itself. For example, a URL such as "/spot/hello-world/ would be picked up and banned if the "spot" bot was included in the definition list.

An updated version that fixes this would be:
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*.*" ".*(?:%(badbots)s|%(badbotscustom)s).*"$
 
I felt the regex is still rather greedy, so I did some testing and tried to see if I could improve performance. Obviously the more bad bots you have listed, the longer it takes for fail2ban to run the regular expression. Still, the regular expression itself can severely impact performance. As I discovered during testing.

Code:
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*.*" ".*(?:%(badbots)s|%(badbotscustom)s).*"$
Result:
Lines: 38101 lines, 0 ignored, 45 matched, 38056 missed
[processed in 1274.94 sec]
Which is quite long to be frank. So I tinkered with the regular expression and came up with this regex that seems to preform a lot better. In terms of speed/time that is.

Code:
failregex = ^<HOST> -.*"(GET|POST|HEAD) \/.* HTTP\/\d(?:\.\d+)" \d+ \d+ ".*" ".*(?:%(badbots)s|%(badbotscustom)s).*"$
Result:
Lines: 35239 lines, 0 ignored, 45 matched, 35194 missed
[processed in 32.97 sec]

Your mileage may very depending on the number of domains you have and the type of hardware you use.
 
Last edited:
I felt the regex is still rather greedy, so I did some testing and tried to see if I could improve performance. Obviously the more bad bots you have listed, the longer it takes for fail2ban to run the regular expression. Still the regular expression itself can severely impact performance. As I discovered during testing.

Code:
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*.*" ".*(?:%(badbots)s|%(badbotscustom)s).*"$
Result:

Which is quite long to be frank. So I tinkered with the regular expression and came up with this regex that seems to preform a lot better. In terms of speed/time that is.

Code:
failregex = ^<HOST> -.*"(GET|POST|HEAD) \/.* HTTP\/\d(?:\.\d+)" \d+ \d+ ".*(?:%(badbots)s|%(badbotscustom)s).*"$
Result:


Your mileage may very depending on the number of domains you have and the type of hardware you use.

A great improvement on speed indeed!

Just one adjustment to help avoiding matching the URL itself in requests:

failregex = ^<HOST> -.*"(GET|POST|HEAD) \/.* HTTP\/\d(?:\.\d+)" \d+ \d+ ".*" ".*(?:%(badbots)s|%(badbotscustom)s).*"$

Without this, if you were blocking a bot called "spot" and you had a page with a referer that contained "spotting-birds-in-the-wild" you could accidentally trigger that just by virtue of visiting the page.
 
Without this, if you were blocking a bot called "spot" and you had a page with a referer that contained "spotting-birds-in-the-wild" you could accidentally trigger that just by virtue of visiting the page.
You are absolutely right. Somehow I failed to incorporate that into the regex, even after you've explicitly warned about this in your previous post.

I did some more testing today as I wanted to get rid of the .* parts because those are heavy. I've replaced those with [^"]*. Which is safer and seems (slightly) faster. I came up with this (which is safe for referer URLs too) :
Code:
failregex = ^<HOST> -[^"]*"(?:GET|POST|HEAD) \/.* HTTP\/\d(?:\.\d+)" \d+ \d+ "[^"]*" "[^"]*(?:%(badbots)s|%(badbotscustom)s)[^"]*"$

Slowest result out of 1000 test runs:
Lines: 38629 lines, 0 ignored, 35 matched, 38594 missed
[processed in 28.78 sec]
Not bad :)
 
The new regex seems to be a big improvement. How did you come up with the changes applied (how did you know or where did you find out that the obviously much longer, more complicated expression results in faster processing)? My hypothesis is that it is faster as a break condition is met earlier than in the original version so that the rest of the expression does not need to be evaluated. Could that be the case?
 
And now for the part II -

The CPU consuming bots of the world unite!

Use at your own risk and note that some bots below are "legitimate" but blocked due to irrelevant traffic:
badbots = ImagesiftBot|PetalBot|YandexBot|serpstatbot|GeedoProductSearch|Barkrowler|claudebot|SeekportBot|GPTBot|AmazonBot|Amazonbot|Bytespider|Bytedance|fidget-spinner-bot|EmailCollector|WebEMailExtrac|TrackBack/1\.02|sogou music spider|seocompany|LieBaoFast|SEOkicks|Cliqzbot|ssearch_bot|domaincrawler|AhrefsBot|spot|DigExt|Sogou|MegaIndex\.ru|majestic12|80legs|SISTRIX|HTTrack|Semrush|MJ12|Ezooms|CCBot|TalkTalk|Ahrefs|BLEXBot|Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|Guestbook Auto Submitter|Industry Program 1\.0\.x|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|LARBIN-EXPERIMENTAL \(efp@gmx\.net\)|LetsCrawl\.com/1\.0 \+http\://letscrawl\.com/|Lincoln State Web Browser|LMQueueBot/0\.2|LWP\:\:Simple/5\.803|Mac Finder 1\.0\.xx|MFC Foundation Class Library 4\.0|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|Mo College 1\.9|MVAClient|Mozilla/2\.0 \(compatible; NEWT ActiveX; Win32\)|Mozilla/3\.0 \(compatible; Indy Library\)|Mozilla/3\.0 \(compatible; scan4mail \(advanced version\) http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 \(compatible; Advanced Email Extractor v2\.xx\)|Mozilla/4\.0 \(compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at\)|Mozilla/4\.0 \(compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx\)|NameOfAgent \(CMS Spider\)|NASA Search 1\.0|Nsauditor/1\.x|PBrowse 1\.4b|PEval 1\.4b|Poirot|Port Huron Labs|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|RSurf15a 41|RSurf15a 51|RSurf15a 81|searchbot admin@google\.com|ShablastBot 1\.0|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 \(Snap Shots&#44; \+http\://www\.snap\.com\)|sogou develop spider|Sogou Orion spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sogou spider|Sogou web spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sohu agent|SSurf15a 11 |TSurf15a 11|Under the Rainbow 2\.2|User-Agent\: Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.1\)|VadixBot|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00
 
The new regex seems to be a big improvement. How did you come up with the changes applied (how did you know or where did you find out that the obviously much longer, more complicated expression results in faster processing)? My hypothesis is that it is faster as a break condition is met earlier than in the original version so that the rest of the expression does not need to be evaluated. Could that be the case?
Yes, exactly. Greedy expressions are more costly because they need to evaluate more. So it's better to narrow it down and make an expressions follow a pattern as precise as possible. Also some tokens are more efficient then others. For example an non-capturing group is more efficient than a default (capturing) group.

Now, I don't know for sure that using .* is more expensive than using [^"]*. But it's at least a safer option (because it allows all characters except a double quote), that I found in a pull request comment on the fail2ban repository.
 
You are absolutely right. Somehow I failed to incorporate that into the regex, even after you've explicitly warned about this you previous post.

I did some more testing today as I wanted to get rid of the .* parts because those are quite heavy. I've replaced those with [^"]*. Which is safer and seems (slightly) faster. I came up with this (which is safe for referer URLs too) :
Code:
failregex = ^<HOST> -[^"]*"(?:GET|POST|HEAD) \/.* HTTP\/\d(?:\.\d+)" \d+ \d+ "[^"]*" "[^"]*(?:%(badbots)s|%(badbotscustom)s)[^"]*"$

Slowest result out of 1000 test runs:

Not bad :)
Sorry for the noob question but, how can I test it like that on my server? I'm not very familiar with fail2ban-regex tests, but that result looks very promising.
 
Sorry for the noob question but, how can I test it like that on my server? I'm not very familiar with fail2ban-regex tests, but that result looks very promising.
No, that's actually a good question. Fail2ban has a build-in tool for testing the fail2ban filters, called fail2ban-regex, which you can use via command line. Like:
Code:
fail2ban-regex /var/www/vhosts/<your domain>/logs/access_ssl_log /etc/fail2ban/filter.d/apache-badbots.conf
Which uses the /var/www/vhosts/<your domain>/logs/access_ssl_log log as a source to test the Apache Bad bots filter located at /etc/fail2ban/filter.d/apache-badbots.conf. (Replace <your domain> with a domain on the server).

If you want to change/improve your fail2ban filters I higly recommend this blog post by @Peter Debik.
 
Last edited:
You are absolutely right. Somehow I failed to incorporate that into the regex, even after you've explicitly warned about this you previous post.

I did some more testing today as I wanted to get rid of the .* parts because those are quite heavy. I've replaced those with [^"]*. Which is safer and seems (slightly) faster. I came up with this (which is safe for referer URLs too) :
Code:
failregex = ^<HOST> -[^"]*"(?:GET|POST|HEAD) \/.* HTTP\/\d(?:\.\d+)" \d+ \d+ "[^"]*" "[^"]*(?:%(badbots)s|%(badbotscustom)s)[^"]*"$

Slowest result out of 1000 test runs:

Not bad :)

Hi everyone,

Thanks for this great post and all useful input. It really helped in leveraging the fail2ban bad bot jail.
Nevertheless, while using the regex from Kaspar, it has actually started to block almost all IPs... rather than just the bad bots.
Did you encounter the same? Just surprised on the outcome.

Thanks.
 
Thanks for this great post and all useful input. It really helped in leveraging the fail2ban bad bot jail.
Nevertheless, while using the regex from Kaspar, it has actually started to block almost all IPs... rather than just the bad bots.
Did you encounter the same? Just surprised on the outcome.
Hi, I've been using the regex on my production servers for about two months now without any negative effects.

First thing that comes to mind that could cause something like this to happen would be something in your badbots (or badbotscustom) list that matches genuine user agents. Can you trace back what in your domain logs which visitors got banned with which user agents?
 
Us too.

@Alban Staehli You can run "fail2banregex --verbose <log file> /etc/fail2ban/filter.d/apache-badbots.conf" to find out more details.

Thanks for confirmation. Behavior I encountered has nothing to do with the failregex line.
If I use the failregex line with the badbots list from @pleskpanel, works like a charm.

The problem comes from a custom list of badbots I was trying - being a merge between a nginx regex and the one provided here + couple of additional bots.
This somehow gets almost ALL connections to be banned...
Code:
 360Spider|80legs|adscanner|Ahrefs|AhrefsBot|Amazonbot|AmazonBot|ApacheBench|Aport|Applebot|archive|Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|BaiduBot|Baiduspider|Birubot|BLEXBot|bsalsa|BTWebClient|Butterfly|Buzzbot|BuzzSumo|bwh3_user_agent|Bytedance|Bytespider|CamontSpider|CCBot|China Local Browse 2\.6|ClaudeBot|Cliqzbot|CommentReader|ContactBot/0\.2|ContentSmartz|Copier|crawler|crazy|Crowsnest|curl|DataCha0s/2\.0|dataminr|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DeuSu|DigExt|Digincore|DISCo|discobot|Dispatch|domaincrawler|DomainSigma|DomainTools|DotBot|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailCollector|EmailSiphon|EmailSpider|EmailWolf 1\.00|Embedly|ESurf15a 15|Exabot|ExtractorPro|Ezooms|facebookexternalhit|FairShare|Faraday|FeedFetcher|fidget-spinner-bot|filterdb|FlaxCrawler|FlightDeckReportsBot|FlipboardProxy|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|FyberSpider|getintent|getprismatic|Gigabot|Go-http-client|GPTBot|GrapeshotCrawler|Guestbook Auto Submitter|help.jp|HTMLParser|HTTrack|hybrid|ia_archiver|igdeSpyder|Industry Program 1\.0\.x|InfoSeek|InternetSeer|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|Jakarta|Java|Jooblebot|JS-Kit|km.ru|kmSearchBot|Kraken|larbin|LARBIN-EXPERIMENTAL \(efp@gmx\.net\)|Laserlikebot|Leikibot|LetsCrawl\.com/1\.0 \+http\://letscrawl\.com/|libwww|LieBaoFast|Lightspeedsystems|Lincoln State Web Browser|Linguee|LinkBot|linkdexbot|LinkExchanger|linkfluence|LinkpadBot|LivelapBot|LMQueueBot/0\.2|LoadImpactPageAnalyzer|ltx71|LWP\:\:Simple/5\.803|lwp-trivial|Mac Finder 1\.0\.xx|majestic|majestic12|masscan|meanpathbot|Mediatoolkitbot|MegaIndex|MegaIndex\.ru|MetaURI|MFC Foundation Class Library 4\.0|mfibot|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|MJ12|MJ12bot|MLBot|Mo College 1\.9|Mozilla/2\.0 \(compatible; NEWT ActiveX; Win32\)|Mozilla/3\.0 \(compatible; Indy Library\)|Mozilla/3\.0 \(compatible; scan4mail \(advanced version\) http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 \(compatible; Advanced Email Extractor v2\.xx\)|Mozilla/4\.0 \(compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at\)|Mozilla/4\.0 \(compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx\)|musobot|MVAClient|NameOfAgent \(CMS Spider\)|NASA Search 1\.0|NerdByNature|NetSeer|netvampire|NewShareCounts|NING|NjuiceBot|Nsauditor/1\.x|Nutch|Nuzzel|Offline|omgili|omgilibot|OpenHoseBot|openstat|OptimizationCrawler|Panopta|PaperLiBot|PBrowse 1\.4b|peerindex|PetalBot|PEval 1\.4b|pflab|pirst|Poirot|Port Huron Labs|postano|PostRank|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|proximic|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|ptd-crawler|Purebot|PycURL|Python|python|QuerySeekerSpider|rogerbot|RSurf15a 41|RSurf15a 51|RSurf15a 81|Ruby|SafeSearch|Scrapy|SearchBot|searchbot admin@google\.com|SeekportBot|semantic|Semrush|SemrushBot|seocompany|SEOkicks|Seopult|SeznamBot|ShablastBot 1\.0|SISTRIX|SiteBot|Slurp|SMTBot|SMUrlExpander|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 \(Snap Shots&#44; \+http\://www\.snap\.com\)|SNAPSHOT|socialmediascanner|Sogou|sogou develop spider|sogou music spider|Sogou Orion spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sogou spider|Sogou web spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sohu agent|solomono|Soup|spbot|spot|spredbot|SputnikBot|ssearch_bot|SSurf15a 11 |statdom|StatOnlineRuBot|suggybot|Superfeedr|SurveyBot|SWeb|Tagoobot|TalkTalk|Teleport|TrackBack/1\.02|trendictionbot|TSearcher|TSurf15a 11|ttCrawler|TurnitinBot|TweetmemeBot|Twiceler|ubermetrics|Under the Rainbow 2\.2|UnwindFetchor|Uptimebot|urllib|User-Agent\: Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.1\)|uTorrent|VadixBot|veoozbot|Voyager|WBSearchBot|WebCopier|WebEMailExtrac|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00|Wget|WordPress|woriobot|Yeti|YottosBot|Zeus|zitebot|ZmEu
 
Back
Top