• If you are still using CentOS 7.9, it's time to convert to Alma 8 with the free centos2alma tool by Plesk or Plesk Migrator. Please let us know your experiences or concerns in this thread:
    CentOS2Alma discussion

Issue Default plesk-apache-badbot fail2ban doesn't work

John41

New Pleskian
Server operating system version
Debian 11
Plesk version and microupdate number
v18.0.59
Hello,

I think I have a setting problem with Fail2ban apache-badbot filter because I unfortunately have attacks of this type, and Fail2ban does not ban any IP.

The default setting is:
[Definition]
badbotscustom = EmailCollector|WebEMailExtrac|TrackBack/1\.02|sogou music spider|(?:Mozilla/\d+\.\d+ )?Jorgee
badbots = Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|Guestbook Auto Submitter|Industry Program 1\.0\.x|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|LARBIN-EXPERIMENTAL \(efp@gmx\.net\)|LetsCrawl\.com/1\.0 \+http\://letscrawl\.com/|Lincoln State Web Browser|LMQueueBot/0\.2|LWP\:\:Simple/5\.803|Mac Finder 1\.0\.xx|MFC Foundation Class Library 4\.0|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|Mo College 1\.9|MVAClient|Mozilla/2\.0 \(compatible; NEWT ActiveX; Win32\)|Mozilla/3\.0 \(compatible; Indy Library\)|Mozilla/3\.0 \(compatible; scan4mail \(advanced version\) http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 \(compatible; Advanced Email Extractor v2\.xx\)|Mozilla/4\.0 \(compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at\)|Mozilla/4\.0 \(compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx\)|NameOfAgent \(CMS Spider\)|NASA Search 1\.0|Nsauditor/1\.x|PBrowse 1\.4b|PEval 1\.4b|Poirot|Port Huron Labs|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|RSurf15a 41|RSurf15a 51|RSurf15a 81|searchbot admin@google\.com|ShablastBot 1\.0|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 \(Snap Shots, \+http\://www\.snap\.com\)|sogou develop spider|Sogou Orion spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sogou spider|Sogou web spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sohu agent|SSurf15a 11 |TSurf15a 11|Under the Rainbow 2\.2|User-Agent\: Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.1\)|VadixBot|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
ignoreregex =
datepattern = ^[^\[]*\[({DATE})
{^LN-BEG}


I've had a robot crawling my site with codes 200 and 301 for over half an hour, overloading my server. It's the same IP sending a dozen requests to many different pages in the same second.



Is it possible to limit this type of request by modifying the Fail2ban apache-badbot settings (is this the correct jail?)?



Thank you for your answers!
 
You're not entirely incorrect on this -

I had to modify the failregex on the default one for it to work correctly.
 
As a quick update to this post, the regex above is a bit too greedy since it would effectively match results in the URL itself. For example, a URL such as "/spot/hello-world/ would be picked up and banned if the "spot" bot was included in the definition list.

An updated version that fixes this would be:
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*.*" ".*(?:%(badbots)s|%(badbotscustom)s).*"$
 
I felt the regex is still rather greedy, so I did some testing and tried to see if I could improve performance. Obviously the more bad bots you have listed, the longer it takes for fail2ban to run the regular expression. Still the regular expression itself can severely impact performance. As I discovered during testing.

Code:
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*.*" ".*(?:%(badbots)s|%(badbotscustom)s).*"$
Result:
Lines: 38101 lines, 0 ignored, 45 matched, 38056 missed
[processed in 1274.94 sec]
Which is quite long to be frank. So I tinkered with the regular expression and came up with this regex that seems to preform a lot better. In terms of speed/time that is.

Code:
failregex = ^<HOST> -.*"(GET|POST|HEAD) \/.* HTTP\/\d(?:\.\d+)" \d+ \d+ ".*" ".*(?:%(badbots)s|%(badbotscustom)s).*"$
Result:
Lines: 35239 lines, 0 ignored, 45 matched, 35194 missed
[processed in 32.97 sec]

Your mileage may very depending on the number of domains you have and the type of hardware you use.
 
Last edited:
I felt the regex is still rather greedy, so I did some testing and tried to see if I could improve performance. Obviously the more bad bots you have listed, the longer it takes for fail2ban to run the regular expression. Still the regular expression itself can severely impact performance. As I discovered during testing.

Code:
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*.*" ".*(?:%(badbots)s|%(badbotscustom)s).*"$
Result:

Which is quite long to be frank. So I tinkered with the regular expression and came up with this regex that seems to preform a lot better. In terms of speed/time that is.

Code:
failregex = ^<HOST> -.*"(GET|POST|HEAD) \/.* HTTP\/\d(?:\.\d+)" \d+ \d+ ".*(?:%(badbots)s|%(badbotscustom)s).*"$
Result:


Your mileage may very depending on the number of domains you have and the type of hardware you use.

A great improvement on speed indeed!

Just one adjustment to help avoiding matching the URL itself in requests:

failregex = ^<HOST> -.*"(GET|POST|HEAD) \/.* HTTP\/\d(?:\.\d+)" \d+ \d+ ".*" ".*(?:%(badbots)s|%(badbotscustom)s).*"$

Without this, if you were blocking a bot called "spot" and you had a page with a referer that contained "spotting-birds-in-the-wild" you could accidentally trigger that just by virtue of visiting the page.
 
Without this, if you were blocking a bot called "spot" and you had a page with a referer that contained "spotting-birds-in-the-wild" you could accidentally trigger that just by virtue of visiting the page.
You are absolutely right. Somehow I failed to incorporate that into the regex, even after you've explicitly warned about this you previous post.

I did some more testing today as I wanted to get rid of the .* parts because those are quite heavy. I've replaced those with [^"]*. Which is safer and seems (slightly) faster. I came up with this (which is safe for referer URLs too) :
Code:
failregex = ^<HOST> -[^"]*"(?:GET|POST|HEAD) \/.* HTTP\/\d(?:\.\d+)" \d+ \d+ "[^"]*" "[^"]*(?:%(badbots)s|%(badbotscustom)s)[^"]*"$

Slowest result out of 1000 test runs:
Lines: 38629 lines, 0 ignored, 35 matched, 38594 missed
[processed in 28.78 sec]
Not bad :)
 
  • Like
Reactions: JVG
The new regex seems to be a big improvement. How did you come up with the changes applied (how did you know or where did you find out that the obviously much longer, more complicated expression results in faster processing)? My hypothesis is that it is faster as a break condition is met earlier than in the original version so that the rest of the expression does not need to be evaluated. Could that be the case?
 
And now for the part II -

The CPU consuming bots of the world unite!

Use at your own risk and note that some bots below are "legitimate" but blocked due to irrelevant traffic:
badbots = ImagesiftBot|PetalBot|YandexBot|serpstatbot|GeedoProductSearch|Barkrowler|claudebot|SeekportBot|GPTBot|AmazonBot|Amazonbot|Bytespider|Bytedance|fidget-spinner-bot|EmailCollector|WebEMailExtrac|TrackBack/1\.02|sogou music spider|seocompany|LieBaoFast|SEOkicks|Cliqzbot|ssearch_bot|domaincrawler|AhrefsBot|spot|DigExt|Sogou|MegaIndex\.ru|majestic12|80legs|SISTRIX|HTTrack|Semrush|MJ12|Ezooms|CCBot|TalkTalk|Ahrefs|BLEXBot|Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|Guestbook Auto Submitter|Industry Program 1\.0\.x|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|LARBIN-EXPERIMENTAL \(efp@gmx\.net\)|LetsCrawl\.com/1\.0 \+http\://letscrawl\.com/|Lincoln State Web Browser|LMQueueBot/0\.2|LWP\:\:Simple/5\.803|Mac Finder 1\.0\.xx|MFC Foundation Class Library 4\.0|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|Mo College 1\.9|MVAClient|Mozilla/2\.0 \(compatible; NEWT ActiveX; Win32\)|Mozilla/3\.0 \(compatible; Indy Library\)|Mozilla/3\.0 \(compatible; scan4mail \(advanced version\) http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 \(compatible; Advanced Email Extractor v2\.xx\)|Mozilla/4\.0 \(compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at\)|Mozilla/4\.0 \(compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx\)|NameOfAgent \(CMS Spider\)|NASA Search 1\.0|Nsauditor/1\.x|PBrowse 1\.4b|PEval 1\.4b|Poirot|Port Huron Labs|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|RSurf15a 41|RSurf15a 51|RSurf15a 81|searchbot admin@google\.com|ShablastBot 1\.0|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 \(Snap Shots&#44; \+http\://www\.snap\.com\)|sogou develop spider|Sogou Orion spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sogou spider|Sogou web spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sohu agent|SSurf15a 11 |TSurf15a 11|Under the Rainbow 2\.2|User-Agent\: Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.1\)|VadixBot|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00
 
The new regex seems to be a big improvement. How did you come up with the changes applied (how did you know or where did you find out that the obviously much longer, more complicated expression results in faster processing)? My hypothesis is that it is faster as a break condition is met earlier than in the original version so that the rest of the expression does not need to be evaluated. Could that be the case?
Yes, exactly. Greedy expressions are more costly because they need to evaluate more. So it's better to narrow it down and make an expressions follow a pattern as precise as possible. Also some tokens are more efficient then others. For example an non-capturing group is more efficient than a default (capturing) group.

Now, I don't now for sure that using .* is more expensive than using [^"]*. But it's at least a safer option (because it allows all characters except a double quote), that I found in a pull request comment on the fail2ban repository.
 
You are absolutely right. Somehow I failed to incorporate that into the regex, even after you've explicitly warned about this you previous post.

I did some more testing today as I wanted to get rid of the .* parts because those are quite heavy. I've replaced those with [^"]*. Which is safer and seems (slightly) faster. I came up with this (which is safe for referer URLs too) :
Code:
failregex = ^<HOST> -[^"]*"(?:GET|POST|HEAD) \/.* HTTP\/\d(?:\.\d+)" \d+ \d+ "[^"]*" "[^"]*(?:%(badbots)s|%(badbotscustom)s)[^"]*"$

Slowest result out of 1000 test runs:

Not bad :)
Sorry for the noob question but, how can I test it like that on my server? I'm not very familiar with fail2ban-regex tests, but that result looks very promising.
 
Sorry for the noob question but, how can I test it like that on my server? I'm not very familiar with fail2ban-regex tests, but that result looks very promising.
No, that's actually a good question. Fail2ban has build-in tool for testing the fail2ban filters, called fail2ban-regex, which you can use via command line. Like:
Code:
fail2ban-regex /var/www/vhosts/<your domain>/logs/access_ssl_log /etc/fail2ban/filter.d/apache-badbots.conf
Which uses the /var/www/vhosts/<your domain>/logs/access_ssl_log log as a srouces to test the Apache Bad bots filter located at /etc/fail2ban/filter.d/apache-badbots.conf. (Replace <your domain> with a domain on the server).

If you want to change/improve your fail2ban filters I higly recommend this blog post by @Peter Debik.
 
Back
Top