• If you are still using CentOS 7.9, it's time to convert to Alma 8 with the free centos2alma tool by Plesk or Plesk Migrator. Please let us know your experiences or concerns in this thread:
    CentOS2Alma discussion

Question Block MJ12bot with Plesk Fail2Ban plesk-apache-badbot Filter

WebHostingAce

Silver Pleskian
Hi,

How can I block MJ12bot with Plesk Fail2Ban plesk-apache-badbot Filter?

in access_ssl_log
Code:
163.172.68.121 - - [16/Jan/2017:19:36:49 +1100] "GET /wishlist/index/add/product/217/form_key/yAOYefomEzibxwE4/ HTTP/1.0" 302 1038 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)"

Tried in apache-badbot Filter,
Code:
[Definition]
badbotscustom = EmailCollector|WebEMailExtrac|TrackBack/1\.02|sogou music spider|MJ12Bot
badbots = Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|Guestbook Auto Submitter|Industry Program 1\.0\.x|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|LARBIN-EXPERIMENTAL \(efp@gmx\.net\)|LetsCrawl\.com/1\.0 +http\://letscrawl\.com/|Lincoln State Web Browser|LMQueueBot/0\.2|LWP\:\:Simple/5\.803|Mac Finder 1\.0\.xx|MFC Foundation Class Library 4\.0|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|Mo College 1\.9|MVAClient|Mozilla/2\.0 \(compatible; NEWT ActiveX; Win32\)|Mozilla/3\.0 \(compatible; Indy Library\)|Mozilla/3\.0 \(compatible; scan4mail \(advanced version\) http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 \(compatible; Advanced Email Extractor v2\.xx\)|Mozilla/4\.0 \(compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at\)|Mozilla/4\.0 \(compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx\)|NameOfAgent \(CMS Spider\)|NASA Search 1\.0|Nsauditor/1\.x|PBrowse 1\.4b|PEval 1\.4b|Poirot|Port Huron Labs|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|RSurf15a 41|RSurf15a 51|RSurf15a 81|searchbot admin@google\.com|ShablastBot 1\.0|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 \(Snap Shots, +http\://www\.snap\.com\)|sogou develop spider|Sogou Orion spider/3\.0\(+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sogou spider|Sogou web spider/3\.0\(+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sohu agent|SSurf15a 11 |TSurf15a 11|Under the Rainbow 2\.2|User-Agent\: Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.1\)|VadixBot|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00
failregex = ^<HOST> -.*"(GET|POST).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
ignoreregex =

Cant get it to work. Does anyone know how to do it?

Thank you.
 
For the fail2ban rule, add this definition to the end of the "badbots" line of /etc/fail2ban/filter.d/apache-badbots.conf:
Code:
|Mozilla/5\.0 \(compatible; MJ12bot/v1\.4\.7 http\://www\.mj12bot\.com/\)
It shouldl match the MJ12bot as shown by your /wishlist... example.

But when the version number changes or anything else in that string changes, it will not match. There might be a better approach that covers a much wider range and is easier to handle. That is simply blocking the bots by .htaccess rewrite rules. The advantages are the much easier handling and no need for fail2ban to scan logs. As the server saves a lot of CPU time in reading and analyzing logs, choosing the .htaccess approach might even be faster and save more CPU time than blocking IP addresses via iptables (that is because fail2ban needs a lot of computing power to identify the problematic IP addresses in the first place). An example for how such entries in an .htaccess file could look like for most common bad bots:

Code:
RewriteEngine On
RewriteOptions inherit
RewriteCond %{HTTP_USER_AGENT} (AhrefsBot|spbot|DigExt|Sogou) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (MegaIndex.ru|majestic12|80legs|SISTRIX|HTTrack|Semrush) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (MJ12|MJ12bot|MJ12Bot|Ezooms|CCBot|TalkTalk|Ahrefs) [NC]
RewriteRule .* - [F]
 
Hi @Peter Debik ,
I agree that fail2ban may not be the best tool to block bots, but what about iptables? i.e. sudo iptables -I INPUT -p TCP -m string --string "Baiduspider" --algo bm -j DROP

With itables, spam bot traffic doesn't even get to nginx and apache. There is obviously a performance penalty for checking user agent of every request at firewall level... I am now trying your approach for comparison with iptables one (to evaluate the effect on page loading times) - though, instead of using .htaccess I included these as Apache directives for HTTP and HTTPS. One obvious drawback is that I am now getting a lot of records from bots in access log and, most importantly, I started getting these messages in error log for every blocked bot hit: AH00124: Request exceeded the limit of 10 internal redirects due to probable configuration error. Use 'LimitInternalRecursion' to increase the limit if necessary. Use 'LogLevel debug' to get a backtrace.
May be the errors have something to do with my configuration: I run nginx as a proxy for apache...?
 
The Problem with iptables can be that this slows down each and every single packet that is processed at the network interface. It does so even if no conditions are met, no traffic is filtered. But yes, it is of course possible to drop the traffic before it reaches the web servers. It's probably about balancing things out. Here we don't see that many bad bots requests that it would be worthwhile to slow everything down to catch them in advance, but there can be situations where it is better to block the traffice right away.
 
@Peter Debik , re "Here we don't see that many bad bots requests", you are lucky, on my site/VPS spam bots/crowlers generate almost 40% of traffic :(
Re: "every single packet that is processed at the network interface": you are right, it's all about proportion of bad packets (BP) to good packets (GP). Here is what happens:

With iptables blocking:
Network Interface: Must inspect GP and BP, drops all BP
Nginx/Apache: Receives GP only, writes logs for GP only

With .htaccess blocking:
Network Interface: N/A
Nginx/Apache: Receives GP and BP, writes logs for GP and BP, drops all BP

So the question is a trade-off between inspection of all packets at a network interface vs. processing extra packets at Nginx/Apache level and extra logging (every request from "bad bots" will be logged in nginx and apache access logs, which is not the case if blocked at iptables level). To answer this question I decided to do some comparison testing. So far I have 2 samples of 500+ visits each from the same metro area and the page loading time is about 10% lower when blocking bots with iptables. Though, the difference may have something to do with this recursion error mentioned in the previous post... I still cannot figure out what's wrong there.

Mike
 
I also recommend handling this at the firewall level as it can really decrease load on the server and it helps your clients as their webstats are no longer poisoned with all this fake traffic.

You would be surprised at how much traffic these bots are doing. In Juggernaut Firewall we have a bad-bot trigger that will catch these bad bots and block them right away. You can also tell it to block bad subnets from bots that are repeatedly coming from the same netblock.

For some of our larger clients I actually pull bad bot IPs directly from their access logs then generate a blocklist based off that. The good thing is that Juggernaut blocklists can use ipset so there is almost no slow down even when blocking thousands of IP addresses.

Example
Code:
zcat -f /var/www/vhosts/system/*/statistics/logs/access_*log /var/www/vhosts/system/*/statistics/logs/access_*log.processed* | awk -F' - |\\"' '{print $1, $7}' | grep -i '360Spider\|80legs\|Acunetix\|AhrefsBot\|aiHitBot\|BackDoorBot\|Bandit\|Baiduspider\|DotBot\|Exabot\|FHscan\|Havij\|HTTrack\|MJ12bot\|moget\|Nutch\|ichiro\|RedBot\|SemrushBot\|SeznamBot\|Sogou\|Sosospider\|spbot\|WebZIP\|XoviBot\|Xenu\|Yandex\|Yeti\|YisouSpider\|Zeus' | awk '{print $1}' | sort -n -t . -k 1,1 -k 2,2 -k 3,3 -k 4,4 | uniq > bad_bots.txt

You can then optimize that file using iprange to get a nice optimized bad bot blocklist that all your servers can pull at regular intervals.
 
@danami , I used to block ip ranges with iptables but gave up on this because it was slowing the site significantly - no, I didn't use ipsets, so this may be a reason. I am going to have a close look at your Juggernaut Firewall, but my main problem at the moment (and the reason why I am evaluating .htaccess blocking even though it seems to be less efficient) is HTTPS, my VPS is running my business site and I plan to switch the site to HTTPS only. Unfortunately, with HTTPS I can no longer stop bad bots at firewall level using user agent string match :(
 
Last edited:
@danami
Unfortunately, with HTTPS I can no longer stop bad bots at firewall level using user agent string match :(

Juggernauts login failure daemon constantly monitors your web servers access_logs so it will work for both HTTP and HTTPS and still block bots. The messenger service also supports HTTPS so you can show a message over HTTPS to users telling them that they are being blocked. The messenger service also parses and use all your web servers SSL certificates so they won't get a certificate mismatch error when they are redirected.
 
Last edited:
Back
Top