• We value your experience with Plesk during 2024
    Plesk strives to perform even better in 2025. To help us improve further, please answer a few questions about your experience with Plesk Obsidian 2024.
    Please take this short survey:

    https://pt-research.typeform.com/to/AmZvSXkx
  • The Horde webmail has been deprecated. Its complete removal is scheduled for April 2025. For details and recommended actions, see the Feature and Deprecation Plan.
  • We’re working on enhancing the Monitoring feature in Plesk, and we could really use your expertise! If you’re open to sharing your experiences with server and website monitoring or providing feedback, we’d love to have a one-hour online meeting with you.

Question Rate Limiting

John41

New Pleskian
Server operating system version
Debian 11
Plesk version and microupdate number
18.0.59
Hello,

Yesterday, several robots crawled my site with codes 200 and 301, overloading my server. The same IP sent a dozen requests to many different pages in the same second, for over an hour.

Is it possible in Plesk to limit this number of connections per second, as in the case of Rate Limiting?

Thank you!
 
I've seen an incredible increase of such robot visites especially from Amazon AWS instances recently. It went up about 12 times of the usual traffic, measured across a dozen machines, so probably it's a general issue as it does not target specific IPs or domains. You can try to block the traffic with the methods explained in How to Avoid High CPU Load & Block Bad Bots with Plesk. I recently added some extra rules not mentioned in the article that resulted from analysis of ongoing bad bot visits:

Code:
^<HOST> .*"GET /(aa/|ss/|rr/|ig/|in/|be/|go/|sidwso\.php|filefuns\.php|gel4y\.php|\.tmb/admin.php|access\.php|wp-admin/includes/xmrlpc\.php|\.well-known/pki-validation/cloud\.php|inicio-sesion\.php|admin-post\.php|notip\.html|images/pt_logo\.svg|images/process\.jpg|pl/payu/pay\.php|san_filez/img/alert\.svg|files/img/blank\.gif|merchantbank/pageBank/bank).*" 404.*

Besides that, you can also check whether the visits are coming from the same subnet and if these amount to hundreds or thousands block the whole subnet by a suitable rule in ip tables. For example when you observe a very high number of bans by Fail2Ban for a specific subnet like 47.128.... you could
iptables -I INPUT 1 -s 47.128.0.0/16 -j DROP
But make sure, nothing important like Let's Encrypt servers or other stuff your system needs to contact is located in such a subnet.

Then you could take a look at mod_evasive, an Apache module that can block excessive traffic from the same source. I do not recommend it, though, because it comes with issues.
 
Thank you for your reply.

I have read your article and have modified the Fail2ban apache-badbots filter with the following information, having taken your article and the clarification indicated in your previous message:
[Definition]
badbotscustom = thesis-research-bot
badbots = GPTBot|AmazonBot|Bytespider|Bytedance|fidget-spinner-bot|EmailCollector|WebEMailExtrac|TrackBack/1\.02|sogou music spider|seocompany|LieBaoFast|SEOkicks|Uptimebot|Cliqzbot|ssearch_bot|domaincrawler|AhrefsBot|spot|DigExt|Sogou|MegaIndex\.ru|majestic12|80legs|SISTRIX|HTTrack|Semrush|MJ12|Ezooms|CCBot|TalkTalk|Ahrefs|BLEXBot|Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|Guestbook Auto Submitter|Industry Program 1\.0\.x|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|LARBIN-EXPERIMENTAL \(efp@gmx\.net\)|LetsCrawl\.com/1\.0 \+http\://letscrawl\.com/|Lincoln State Web Browser|LMQueueBot/0\.2|LWP\:\:Simple/5\.803|Mac Finder 1\.0\.xx|MFC Foundation Class Library 4\.0|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|Mo College 1\.9|MVAClient|Mozilla/2\.0 \(compatible; NEWT ActiveX; Win32\)|Mozilla/3\.0 \(compatible; Indy Library\)|Mozilla/3\.0 \(compatible; scan4mail \(advanced version\) http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 \(compatible; Advanced Email Extractor v2\.xx\)|Mozilla/4\.0 \(compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at\)|Mozilla/4\.0 \(compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx\)|NameOfAgent \(CMS Spider\)|NASA Search 1\.0|Nsauditor/1\.x|PBrowse 1\.4b|PEval 1\.4b|Poirot|Port Huron Labs|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|RSurf15a 41|RSurf15a 51|RSurf15a 81|searchbot admin@google\.com|ShablastBot 1\.0|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 \(Snap Shots&#44; \+http\://www\.snap\.com\)|sogou develop spider|Sogou Orion spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sogou spider|Sogou web spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sohu agent|SSurf15a 11 |TSurf15a 11|Under the Rainbow 2\.2|User-Agent\: Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.1\)|VadixBot|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
^<HOST> .*"GET /(aa/|ss/|rr/|ig/|in/|be/|go/|sidwso\.php|filefuns\.php|gel4y\.php|\.tmb/admin.php|access\.php|wp-admin/includes/xmrlpc\.php|\.well-known/pki-validation/cloud\.php|inicio-sesion\.php|admin-post\.php|notip\.html|images/pt_logo\.svg|images/process\.jpg|pl/payu/pay\.php|san_filez/img/alert\.svg|files/img/blank\.gif|merchantbank/pageBank/bank).*" 404.*
ignoreregex =
datepattern = ^[^\[]*\[({DATE})
{^LN-BEG}
Can you confirm that the failregex parameters will take into account the 2 scenarios?

Also, is it useful to keep datepattern ?
 
I can't add both rules to the jail filter :
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
^<HOST> .*"GET /(aa/|ss/|rr/|ig/|in/|be/|go/|sidwso\.php|filefuns\.php|gel4y\.php|\.tmb/admin.php|access\.php|wp-admin/includes/xmrlpc\.php|\.well-known/pki-validation/cloud\.php|inicio-sesion\.php|admin-post\.php|notip\.html|images/pt_logo\.svg|images/process\.jpg|pl/payu/pay\.php|san_filez/img/alert\.svg|files/img/blank\.gif|merchantbank/pageBank/bank).*" 404.*

but I get the following error :
Error: Unable to register jail filter :
f2bmng failed: f2bmng.py:382: DeprecationWarning: This method will be removed in future versions. Use 'parser.read_file()' instead.
ERROR:__main__:Source contains parsing errors: '<stdin>'
[line 5]: '^<HOST> .*"GET /(aa/|ss/|rr/|ig/|in/|be/|go/|sidwso\\.php|filefuns\\.php|gel4y\\.php|\\.tmb/admin.php|access\\.php|wp-admin/includes/xmrlpc\\.php|\\.well-known/pki-validation/cloud\\.php|inicio-sesion\\.php|admin-post\\.php|notip\\.html|images/pt_logo\\.svg|images/process\\.jpg|pl/payu/pay\\.php|san_filez/img/alert\\.svg|files/img/blank\\.gif|merchantbank/pageBank/bank).*" 404.*\r\n'.
 
I cannot see why it doesn't work for you. You could try to first test it with fail2ban-regex. It might give better clues.
 
Okay, I'll look into it.

Is it useful to keep datepattern = ^[^\[]*\[({DATE}) {^LN-BEG} for Fail2ban apache-badbots ?
 
Thanks for your reply!

I finally managed to put several conditions to the failregex, I had to add a few spaces between the settings so that they would be properly validated.
 
Robots are indeed blocked with Fail2ban, but is there a way to limit crawls by robots that are not declared as such (such as Turnitinbot) and that visit 10 pages per second with codes 200 only?
 
Yes of course, but to prevent further attacks of this type, isn't there a way of limiting the number of simultaneous connections per IP?
 
The problem lies in what's an "established" connection. Once the same source IP, target IP and port have successfully connected (and passed the iptables rule that can rate-limite connections), they are considered "established" so that all consecutive requests will also pass. So rate-limiting based on iptables would only work well, if each website of yours has a different ip address. Do they?
 
As I understand it, your bot does not have a signature (no name) and it only crawls existing pages, correct? Are you sure that all the 200 responses are truly existing pages or might a rewrite rule in your .htaccess file respond to everything with code 200?
 
That's right.
The robot pretends to be a regular visitor. It crawls pages that actually exist by following internal links, with code 200.
 
If the pages really exist, you can only manually ban the IP, e.g.
# fail2ban-client set recidive banip <ip address goes here>

But if you find a file that responds with a code 200 but is actually not a valid page of the site, you could use that to detect the bot and ban it automatically.
 
You can try to block the traffic with the methods explained in How to Avoid High CPU Load & Block Bad Bots with Plesk. I recently added some extra rules not mentioned in the article that resulted from analysis of ongoing bad bot visits

Despite the fact that "Bytespider" is in the badbot list, it crawled my site for most of the night. My settings are as follows. Is there a wrong setting?

[Definition]
badbotscustom = thesis-research-bot
badbots = GPTBot|AmazonBot|Bytespider|Bytedance|fidget-spinner-bot|EmailCollector|WebEMailExtrac|TrackBack/1\.02|sogou music spider|seocompany|LieBaoFast|SEOkicks|Uptimebot|Cliqzbot|ssearch_bot|domaincrawler|AhrefsBot|spot|DigExt|Sogou|MegaIndex\.ru|majestic12|80legs|SISTRIX|HTTrack|Semrush|MJ12|Ezooms|CCBot|TalkTalk|Ahrefs|BLEXBot|Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|Guestbook Auto Submitter|Industry Program 1\.0\.x|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|LARBIN-EXPERIMENTAL \(efp@gmx\.net\)|LetsCrawl\.com/1\.0 \+http\://letscrawl\.com/|Lincoln State Web Browser|LMQueueBot/0\.2|LWP\:\:Simple/5\.803|Mac Finder 1\.0\.xx|MFC Foundation Class Library 4\.0|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|Mo College 1\.9|MVAClient|Mozilla/2\.0 \(compatible; NEWT ActiveX; Win32\)|Mozilla/3\.0 \(compatible; Indy Library\)|Mozilla/3\.0 \(compatible; scan4mail \(advanced version\) http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 \(compatible; Advanced Email Extractor v2\.xx\)|Mozilla/4\.0 \(compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at\)|Mozilla/4\.0 \(compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx\)|NameOfAgent \(CMS Spider\)|NASA Search 1\.0|Nsauditor/1\.x|PBrowse 1\.4b|PEval 1\.4b|Poirot|Port Huron Labs|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|RSurf15a 41|RSurf15a 51|RSurf15a 81|searchbot admin@google\.com|ShablastBot 1\.0|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 \(Snap Shots&#44; \+http\://www\.snap\.com\)|sogou develop spider|Sogou Orion spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sogou spider|Sogou web spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sohu agent|SSurf15a 11 |TSurf15a 11|Turnitinbot|Under the Rainbow 2\.2|User-Agent\: Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.1\)|VadixBot|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
^<HOST> .*"GET /(aa/|ss/|rr/|ig/|in/|be/|go/|sidwso\.php|filefuns\.php|gel4y\.php|\.tmb/admin.php|access\.php|wp-admin/includes/xmrlpc\.php|xmlrpc\.php|application/xhtml+xml|\.well-known/pki-validation/cloud\.php|inicio-sesion\.php|admin-post\.php|notip\.html|images/pt_logo\.svg|images/process\.jpg|pl/payu/pay\.php|san_filez/img/alert\.svg|files/img/blank\.gif|wordpress|merchantbank/pageBank/bank).*" 404.*
^<HOST> .*HEAD .*wordpress*" 404.*
^<HOST> .*GET .*aws(/|_|-)(credentials|secrets|keys).*
^<HOST> .*GET .*credentials/aws.*
^<HOST> .*GET .*secrets/(aws|keys).*
^<HOST> .*GET .*oauth/config.*
^<HOST> .*GET .*config/oauth.*
^<HOST> .*GET .*(travis-scripts|tests-legacy)/.*
^<HOST> .*"GET .*(freshio|woocommerce).*frontend.*" (301|404).*
^<HOST> .*"GET .*contact-form-7/includes.*" (301|404).*
^<HOST> .*"(GET|POST) .*author=.*" 404.*
^<HOST> .*"GET .*.prototype.*" (301|404).*
^<HOST> .*"GET .*application/*" (301|404).*
ignoreregex =
 
Absolutely it works now. Thanks @pleskpanel !

If you have the time, don't hesitate to include the modification in your article How to Avoid High CPU Load & Block Bad Bots with Plesk, which will also help other users.
I'm afraid I won't be able to edit this. The article is meant to jump start you and others on the topic. I am using similar settings on my servers here, but meanwhile already with a lot of changes as attacks constantly evolve.

But what you could do is to add the suggested modification as a comment to the article.
 
Back
Top