Question Rate Limiting

John41 · Mar 27, 2024

Hello,

Yesterday, several robots crawled my site with codes 200 and 301, overloading my server. The same IP sent a dozen requests to many different pages in the same second, for over an hour.

Is it possible in Plesk to limit this number of connections per second, as in the case of Rate Limiting?

Thank you!

Peter Debik · Mar 27, 2024

I've seen an incredible increase of such robot visites especially from Amazon AWS instances recently. It went up about 12 times of the usual traffic, measured across a dozen machines, so probably it's a general issue as it does not target specific IPs or domains. You can try to block the traffic with the methods explained in How to Avoid High CPU Load & Block Bad Bots with Plesk. I recently added some extra rules not mentioned in the article that resulted from analysis of ongoing bad bot visits:

Code:

^<HOST> .*"GET /(aa/|ss/|rr/|ig/|in/|be/|go/|sidwso\.php|filefuns\.php|gel4y\.php|\.tmb/admin.php|access\.php|wp-admin/includes/xmrlpc\.php|\.well-known/pki-validation/cloud\.php|inicio-sesion\.php|admin-post\.php|notip\.html|images/pt_logo\.svg|images/process\.jpg|pl/payu/pay\.php|san_filez/img/alert\.svg|files/img/blank\.gif|merchantbank/pageBank/bank).*" 404.*

Besides that, you can also check whether the visits are coming from the same subnet and if these amount to hundreds or thousands block the whole subnet by a suitable rule in ip tables. For example when you observe a very high number of bans by Fail2Ban for a specific subnet like 47.128.... you could
iptables -I INPUT 1 -s 47.128.0.0/16 -j DROP
But make sure, nothing important like Let's Encrypt servers or other stuff your system needs to contact is located in such a subnet.

Then you could take a look at mod_evasive, an Apache module that can block excessive traffic from the same source. I do not recommend it, though, because it comes with issues.

John41 · Mar 27, 2024

Thank you for your reply.

I have read your article and have modified the Fail2ban apache-badbots filter with the following information, having taken your article and the clarification indicated in your previous message:

[Definition]
badbotscustom = thesis-research-bot
badbots = GPTBot|AmazonBot|Bytespider|Bytedance|fidget-spinner-bot|EmailCollector|WebEMailExtrac|TrackBack/1\.02|sogou music spider|seocompany|LieBaoFast|SEOkicks|Uptimebot|Cliqzbot|ssearch_bot|domaincrawler|AhrefsBot|spot|DigExt|Sogou|MegaIndex\.ru|majestic12|80legs|SISTRIX|HTTrack|Semrush|MJ12|Ezooms|CCBot|TalkTalk|Ahrefs|BLEXBot|Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|Guestbook Auto Submitter|Industry Program 1\.0\.x|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|LARBIN-EXPERIMENTAL $efp@gmx\.net$|LetsCrawl\.com/1\.0 \+http\://letscrawl\.com/|Lincoln State Web Browser|LMQueueBot/0\.2|LWP\:\:Simple/5\.803|Mac Finder 1\.0\.xx|MFC Foundation Class Library 4\.0|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|Mo College 1\.9|MVAClient|Mozilla/2\.0 $compatible; NEWT ActiveX; Win32$|Mozilla/3\.0 $compatible; Indy Library$|Mozilla/3\.0 $compatible; scan4mail \(advanced version$ http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 $compatible; Advanced Email Extractor v2\.xx$|Mozilla/4\.0 $compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at$|Mozilla/4\.0 $compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx$|NameOfAgent $CMS Spider$|NASA Search 1\.0|Nsauditor/1\.x|PBrowse 1\.4b|PEval 1\.4b|Poirot|Port Huron Labs|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|RSurf15a 41|RSurf15a 51|RSurf15a 81|searchbot admin@google\.com|ShablastBot 1\.0|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 $Snap Shots, \+http\://www\.snap\.com$|sogou develop spider|Sogou Orion spider/3\.0$\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07$|sogou spider|Sogou web spider/3\.0$\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07$|sohu agent|SSurf15a 11 |TSurf15a 11|Under the Rainbow 2\.2|User-Agent\: Mozilla/4\.0 $compatible; MSIE 6\.0; Windows NT 5\.1$|VadixBot|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
^<HOST> .*"GET /(aa/|ss/|rr/|ig/|in/|be/|go/|sidwso\.php|filefuns\.php|gel4y\.php|\.tmb/admin.php|access\.php|wp-admin/includes/xmrlpc\.php|\.well-known/pki-validation/cloud\.php|inicio-sesion\.php|admin-post\.php|notip\.html|images/pt_logo\.svg|images/process\.jpg|pl/payu/pay\.php|san_filez/img/alert\.svg|files/img/blank\.gif|merchantbank/pageBank/bank).*" 404.*
ignoreregex =
datepattern = ^[^\[]*\[({DATE})
{^LN-BEG}

Can you confirm that the failregex parameters will take into account the 2 scenarios?

Also, is it useful to keep datepattern ?

John41 · Mar 27, 2024

I can't add both rules to the jail filter :

failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
^<HOST> .*"GET /(aa/|ss/|rr/|ig/|in/|be/|go/|sidwso\.php|filefuns\.php|gel4y\.php|\.tmb/admin.php|access\.php|wp-admin/includes/xmrlpc\.php|\.well-known/pki-validation/cloud\.php|inicio-sesion\.php|admin-post\.php|notip\.html|images/pt_logo\.svg|images/process\.jpg|pl/payu/pay\.php|san_filez/img/alert\.svg|files/img/blank\.gif|merchantbank/pageBank/bank).*" 404.*

but I get the following error :

Error: Unable to register jail filter :
f2bmng failed: f2bmng.py:382: DeprecationWarning: This method will be removed in future versions. Use 'parser.read_file()' instead.
ERROR:__main__:Source contains parsing errors: '<stdin>'
[line 5]: '^<HOST> .*"GET /(aa/|ss/|rr/|ig/|in/|be/|go/|sidwso\\.php|filefuns\\.php|gel4y\\.php|\\.tmb/admin.php|access\\.php|wp-admin/includes/xmrlpc\\.php|\\.well-known/pki-validation/cloud\\.php|inicio-sesion\\.php|admin-post\\.php|notip\\.html|images/pt_logo\\.svg|images/process\\.jpg|pl/payu/pay\\.php|san_filez/img/alert\\.svg|files/img/blank\\.gif|merchantbank/pageBank/bank).*" 404.*\r\n'.

Peter Debik · Mar 27, 2024

I cannot see why it doesn't work for you. You could try to first test it with fail2ban-regex. It might give better clues.

John41 · Mar 27, 2024

Okay, I'll look into it.

Is it useful to keep datepattern = ^[^\[]*\[({DATE}) {^LN-BEG} for Fail2ban apache-badbots ?

Peter Debik · Mar 27, 2024

No, it is not required if your log files have a default format.

John41 · Mar 27, 2024

Thanks for your reply!

I finally managed to put several conditions to the failregex, I had to add a few spaces between the settings so that they would be properly validated.

John41 · Mar 28, 2024

Robots are indeed blocked with Fail2ban, but is there a way to limit crawls by robots that are not declared as such (such as Turnitinbot) and that visit 10 pages per second with codes 200 only?

Peter Debik · Mar 28, 2024

You could add the Name Turnitinbot to the robots list of the file.

John41 · Mar 28, 2024

Yes of course, but to prevent further attacks of this type, isn't there a way of limiting the number of simultaneous connections per IP?

Peter Debik · Mar 28, 2024

The problem lies in what's an "established" connection. Once the same source IP, target IP and port have successfully connected (and passed the iptables rule that can rate-limite connections), they are considered "established" so that all consecutive requests will also pass. So rate-limiting based on iptables would only work well, if each website of yours has a different ip address. Do they?

John41 · Mar 28, 2024

I think so, but the problem arises with a single IP making thousands of connections on single site.

Peter Debik · Mar 29, 2024

As I understand it, your bot does not have a signature (no name) and it only crawls existing pages, correct? Are you sure that all the 200 responses are truly existing pages or might a rewrite rule in your .htaccess file respond to everything with code 200?

John41 · Mar 29, 2024

That's right.
The robot pretends to be a regular visitor. It crawls pages that actually exist by following internal links, with code 200.

Peter Debik · Mar 29, 2024

If the pages really exist, you can only manually ban the IP, e.g.
# fail2ban-client set recidive banip <ip address goes here>

But if you find a file that responds with a code 200 but is actually not a valid page of the site, you could use that to detect the bot and ban it automatically.

John41 · Mar 30, 2024

Peter Debik said:
You can try to block the traffic with the methods explained in How to Avoid High CPU Load & Block Bad Bots with Plesk. I recently added some extra rules not mentioned in the article that resulted from analysis of ongoing bad bot visits

Despite the fact that "Bytespider" is in the badbot list, it crawled my site for most of the night. My settings are as follows. Is there a wrong setting?

[Definition]
badbotscustom = thesis-research-bot
badbots = GPTBot|AmazonBot|Bytespider|Bytedance|fidget-spinner-bot|EmailCollector|WebEMailExtrac|TrackBack/1\.02|sogou music spider|seocompany|LieBaoFast|SEOkicks|Uptimebot|Cliqzbot|ssearch_bot|domaincrawler|AhrefsBot|spot|DigExt|Sogou|MegaIndex\.ru|majestic12|80legs|SISTRIX|HTTrack|Semrush|MJ12|Ezooms|CCBot|TalkTalk|Ahrefs|BLEXBot|Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|Guestbook Auto Submitter|Industry Program 1\.0\.x|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|LARBIN-EXPERIMENTAL $efp@gmx\.net$|LetsCrawl\.com/1\.0 \+http\://letscrawl\.com/|Lincoln State Web Browser|LMQueueBot/0\.2|LWP\:\:Simple/5\.803|Mac Finder 1\.0\.xx|MFC Foundation Class Library 4\.0|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|Mo College 1\.9|MVAClient|Mozilla/2\.0 $compatible; NEWT ActiveX; Win32$|Mozilla/3\.0 $compatible; Indy Library$|Mozilla/3\.0 $compatible; scan4mail \(advanced version$ http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 $compatible; Advanced Email Extractor v2\.xx$|Mozilla/4\.0 $compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at$|Mozilla/4\.0 $compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx$|NameOfAgent $CMS Spider$|NASA Search 1\.0|Nsauditor/1\.x|PBrowse 1\.4b|PEval 1\.4b|Poirot|Port Huron Labs|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|RSurf15a 41|RSurf15a 51|RSurf15a 81|searchbot admin@google\.com|ShablastBot 1\.0|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 $Snap Shots, \+http\://www\.snap\.com$|sogou develop spider|Sogou Orion spider/3\.0$\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07$|sogou spider|Sogou web spider/3\.0$\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07$|sohu agent|SSurf15a 11 |TSurf15a 11|Turnitinbot|Under the Rainbow 2\.2|User-Agent\: Mozilla/4\.0 $compatible; MSIE 6\.0; Windows NT 5\.1$|VadixBot|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
^<HOST> .*"GET /(aa/|ss/|rr/|ig/|in/|be/|go/|sidwso\.php|filefuns\.php|gel4y\.php|\.tmb/admin.php|access\.php|wp-admin/includes/xmrlpc\.php|xmlrpc\.php|application/xhtml+xml|\.well-known/pki-validation/cloud\.php|inicio-sesion\.php|admin-post\.php|notip\.html|images/pt_logo\.svg|images/process\.jpg|pl/payu/pay\.php|san_filez/img/alert\.svg|files/img/blank\.gif|wordpress|merchantbank/pageBank/bank).*" 404.*
^<HOST> .*HEAD .*wordpress*" 404.*
^<HOST> .*GET .*aws(/|_|-)(credentials|secrets|keys).*
^<HOST> .*GET .*credentials/aws.*
^<HOST> .*GET .*secrets/(aws|keys).*
^<HOST> .*GET .*oauth/config.*
^<HOST> .*GET .*config/oauth.*
^<HOST> .*GET .*(travis-scripts|tests-legacy)/.*
^<HOST> .*"GET .*(freshio|woocommerce).*frontend.*" (301|404).*
^<HOST> .*"GET .*contact-form-7/includes.*" (301|404).*
^<HOST> .*"(GET|POST) .*author=.*" 404.*
^<HOST> .*"GET .*.prototype.*" (301|404).*
^<HOST> .*"GET .*application/*" (301|404).*
ignoreregex =

Peter Debik · Mar 30, 2024

You may need @pleskpanel 's solution shown in Issue - Default plesk-apache-badbot fail2ban doesn't work

John41 · Mar 30, 2024

Absolutely it works now. Thanks @pleskpanel !

If you have the time, don't hesitate to include the modification in your article How to Avoid High CPU Load & Block Bad Bots with Plesk, which will also help other users.

Peter Debik · Mar 30, 2024

John41 said:
Absolutely it works now. Thanks @pleskpanel !

If you have the time, don't hesitate to include the modification in your article How to Avoid High CPU Load & Block Bad Bots with Plesk, which will also help other users.

I'm afraid I won't be able to edit this. The article is meant to jump start you and others on the topic. I am using similar settings on my servers here, but meanwhile already with a lot of changes as attacks constantly evolve.

But what you could do is to add the suggested modification as a comment to the article.

Question Rate Limiting

New Pleskian

Community Manager until 3/2024

New Pleskian

New Pleskian

Community Manager until 3/2024

New Pleskian

Community Manager until 3/2024

New Pleskian

New Pleskian

Community Manager until 3/2024

New Pleskian

Community Manager until 3/2024

New Pleskian

Community Manager until 3/2024

New Pleskian

Community Manager until 3/2024

New Pleskian

Community Manager until 3/2024

New Pleskian

Community Manager until 3/2024

Similar threads