Issue Googlebot blocked and uptime monitor traffic blocked too...

LionKing · Mar 7, 2023

Hi guys.

Ok so I been looking into this for some time and it seems that Plesk blocks traffic from Googlebot and other indexing bots requesting pages, (Probably also Bing which Yahoo/Duckduckgo also use):

Yandex too:

The same happens for the up-time monitor service we use from WPMU Dev:

Its is WordPress Multisite (Our corporate website) and when we ran the old server with C-panel this issue did not exist which leads me to think that Plesk would be the obvious culprit of the issue.

A quick sneak peak at WPMU Dev's up-time monitoring logs reveal this which might be clue (or not).

They unfortunately do not allow to change method, but it would probably not make a difference since we already know that legitimate indexing bots are also being blocked.

Any ideas?

Thanks in advance.

Kind regards
LionKing

Peter Debik · Mar 8, 2023

There can be a number of settings that block bots. For example if the WP Toolkit security option to block bad bots is activated, this will lock out some. There could also be an entry in your .htaccess file to block bots. ModSecurity blocks also result in 403 errors, however then you can see an entry in error_log by "ModSecurity" at the given time.

LionKing · Mar 8, 2023

Peter Debik said:
WP Toolkit security option to block bad bots is activated, this will lock out some

Thanks Peter. Well We do not use WP Toolkit because it cannot figure out our setup environment which is quite different. Plus we already use services that does that same

Peter Debik said:
There could also be an entry in your .htaccess file to block bots

. As for .htaccess you might be correct and I suppose we need to dig through it and see if there is any .htaccess rule that might be blocking something.

Peter Debik said:
ModSecurity blocks also result in 403 errors, however then you can see an entry in error_log by "ModSecurity" at the given time.

Thanks, We will looks in the logs too to see if there might something that leads us on the right tracj,

Kind regards

mow · Mar 8, 2023

LionKing said:
Ok so I been looking into this for some time and it seems that Plesk blocks traffic from Googlebot and other indexing bots requesting pages, (Probably also Bing which Yahoo/Duckduckgo also use):

That's strange. I've never seen any of those bots use HTTP/1.0. Do you have a stupid loadbalancer in front of your server?
That can't work with modern (v)hosting because HTTP/1.0 is pre-SNI. All such requests would be served by the default server for the IP, same as if you'd directly supply the IP instead of a domain name in the url's host part. Did you assign IPs exclusively to domains (ending up in /etc/nginx/plesk.conf.d/ip_default/)?

LionKing · Mar 8, 2023

mow said:
That's strange. I've never seen any of those bots use HTTP/1.0. Do you have a stupid loadbalancer in front of your server?
That can't work with modern (v)hosting because HTTP/1.0 is pre-SNI. All such requests would be served by the default server for the IP, same as if you'd directly supply the IP instead of a domain name in the url's host part.

Cloudflare's infrastructure is maybe doing stupid things..(?).
We use them for security layer and obviously for the caching part. So its not really our servers that do the initially response. All though the logs provided above is logged by our server/s.

mow said:
t. Did you assign IPs exclusively to domains (ending up in /etc/nginx/plesk.conf.d/ip_default/)?

No we chosen just to use one fixed IP for this server. So all our business apps/systems/company website share the same IP.

Kind regards

mow · Mar 8, 2023

LionKing said:
No we chosen just to use one fixed IP for this server. So all our business apps/systems/company website share the same IP.

And what site did you set as default? (the site you get when you access the IP)

LionKing · Mar 8, 2023

Thanks for the reply Mow.

mow said:
And what site did you set as default? (the site you get when you access the IP)

None.
1.) You need to know the URL/domain name to access our systems on our server. If make a misspelling you will either be served with 404 (if domain is correct), or if nxdomain/and/if typo; you will be served the default landing page which I also mention below here after this with just using the IP address. .

2.) If you just enter the IP address itself in browser and hit "enter" on your keyboard, you will see the default "splash screen/landing page" of Plesk.
(Although we customized it because we think it is unnecessary to announce to the world that our servers are running Plesk.)

Kind regards

mow · Mar 9, 2023

LionKing said:
1.) You need to know the URL/domain name to access our systems on our server. If make a misspelling you will either be served with 404 (if domain is correct), or if nxdomain/and/if typo; you will be served the default landing page which I also mention below here after this with just using the IP address.

Then, that default landing page is also what you (or the bots) get with HTTP/1.0.

LionKing · Mar 9, 2023

mow said:
Then, that default landing page is also what you (or the bots) get with HTTP/1.0.

Well yes for all request that doesn't concern the webapps that is installed.
The above logs are for for the company corporate website: https//:takemarket.co.uk so that is not what you are seing (HTTP/1.0 /default page).

Hmm... With that said; I mean it does say "HTTP/1.0", could it be that we do not allow any unsecure requests I.E it must be over the secure encrypted "https" protocol that we seing this?

mow · Mar 9, 2023

LionKing said:
Well yes for all request that doesn't concern the webapps that is installed.
The above logs are for for the company corporate website: https//:takemarket.co.uk so that is not what you are seing (HTTP/1.0 /default page).

Well they would try to access subpages under that default page which do not exist, and get a 403.
I have no idea why HEAD on / throws a 405, though.

LionKing said:
Hmm... With that said; I mean it does say "HTTP/1.0", could it be that we do not allow any unsecure requests I.E it must be over the secure encrypted "https" protocol that we seing this?

No, the HTTP protocol version is independent from SSL.

LionKing · Mar 9, 2023

Interesting.
Thanks a bunch for the feedback. Sorry for the misspelled link by the way, here is the correctly formatted link: takemarket.co.uk
The conundrum still remains though. I guess we (me and my colleges), just need to keep digging and hopefully we will find something sooner or later.

Kind regards

Issue Googlebot blocked and uptime monitor traffic blocked too...

LionKing

Regular Pleskian

Peter Debik

Community Manager until 3/2024

LionKing

Regular Pleskian

mow

Silver Pleskian

LionKing

Regular Pleskian

mow

Silver Pleskian

LionKing

Regular Pleskian

mow

Silver Pleskian

LionKing

Regular Pleskian

mow

Silver Pleskian

LionKing

Regular Pleskian

Similar threads