Issue Need help » Plesk + CentOS + WP = PHP-FPM CPU @ 100%

mow · Aug 18, 2022

Does this happen when you get hit by search engine crawlers?
We finally got rid of such problems when we added varnish, as moving the affected site to a separate server without plesk didn't help much.

Cike76 · Aug 21, 2022

mow said:
Does this happen when you get hit by search engine crawlers?
We finally got rid of such problems when we added varnish, as moving the affected site to a separate server without plesk didn't help much.

Yes! Crawlers sometimes put a high load on webservers.
I would recomend a wordpress cache plugin + some client caching rules. Using Cloudflare also helps as it works as a proxy and can cache static content.
I use Wp Super Cache + Autoptimize and gets the trick
I also recommend you use Wordfence, in order to stop hacking attempts ( those generate also a high load )

magestyx · Aug 22, 2022

In our case, there have been no noticeable increases in traffic of any kind that causes these.
The new CentOS 7 servers simply had sites migrated to them from CentOS 6 servers that were fine. So the same sites, just different OS and the latest plesk - and here come random inexplicable load spikes galore.

Bitpalast · Aug 23, 2022

Load spikes can always be explained. You just need to find the right hook where to catch them.

Next time it occurs, run
# watch "ps aux | sort -nrk 3,3 | head -n 20"
as a first response to find out which processes are consuming the most CPU time.

I also recommend to check all PHP-FPM processes in such a situation. Most likely you'll have a user that is using many PHP-FPM children and each of them has a high load.
# ps aux | grep php-fpm
That user is your culprit. From there descend into the logs directory of the user's subscription and check what's going on in the error_log and access_ssl_log. You'll probably find the cause there, e.g. frequent requests from bad bots or something similar.

magestyx · Aug 24, 2022

Hi, Peter.

The particularly difficult part of this situation is that when the spikes happen, SSH becomes essentially nonfunctional like the rest of the server. Always being at totally random times and only happening every 12-72 hours, it's about impossible for us to be watching at the right time. The few times we have had SSH open when it happens, it basically freezes and only after the load average calms back down does it work again - and so does all the rest of the server. The only thing we've -ever- been able to catch is when we could get #top to finally respond a few times, and it shows that kswapd0 immediately jumps to 100% for the time of the spike. But no access logs or anything show anything unusual beforehand. All our research regarding the kswapd0 situation points to changing the swappiness setting, which we've tried many, many times.

We've been running servers for 20+ years and haven't seen an issue like this. So frustrating.
One thing to note - this issue DOES seem somehow tied to wordpress sites. The only new centos7 server that has never had these spikes has only Joomla sites on it. All others have wordpress, joomla, custom, etc. Wordpress is by FAR the greediest and most picky on resources, but exactly what it might be triggering is a mystery. And the centos6 servers we migrated from never, ever did this. It's the same sites we migrated from centos6 to centos7 with the latest plesk, and we noticed almost immediately (within days) that this issue was happening. Those original spikes would be 10-12 times per day and could last for up to 30 minutes, so we're vastly improved now after months of countless tweaks - but we'd SO like to know what's actually causing this and stamp it out for good.

Maarten · Aug 24, 2022

Did you enable the option "Take over wp-cron.php" in the WordPress Toolkit and run it on a different schedule?

WP Toolkit

summary

docs.plesk.com

That would prevent running the wp-cron.php every time someone visits a WordPress website.

alvarezcruz · Aug 24, 2022

magestyx said:
it's about impossible for us to be watching at the right time

Install atop to monitor the system automatically. After the issue happens you can check what was running and the build up to it: How to monitor usage of system resources in a period of time using atop?

Cike76 · Aug 27, 2022

magestyx:

If you implement Cgroups maybe the server does not go down and maybe you can pinpoint the Domain that raises the spike.
Its worth a try.

magestyx · Sep 14, 2022

maartenv said:
Did you enable the option "Take over wp-cron.php" in the WordPress Toolkit and run it on a different schedule?

WP Toolkit

summary

docs.plesk.com

That would prevent running the wp-cron.php every time someone visits a WordPress website.

We don't install CMS's via Plesk, but rather directly via FTP so that it's not tied into Plesk settings.
But good idea to check on the crons. We're trying to implement WP Crontrol on all WP sites and will be making that standard moving forward to mitigate this.
Thanks for your suggestion.
Magestyx

magestyx · Sep 14, 2022

Thanks everyone for the replies and suggestions. There's some in here we haven't tried yet.
We do have cgroups installed but Plesk says that there's a lot of limitations of what it can control, and apparently whatever the issue is must be one of those outside its scope. I'm going to double check this one in particular, though.
Much appreciated!

Magestyx

magestyx · Sep 14, 2022

Regarding Cgroups - I checked just now and remember why it wasn't helping, at least from what I understand -
The minimum monitoring timeframe we can set in the Plesk dropdown for cgroups is 5m, and since these spikes seem to always ramp up to a cpu load of 30-60+ in a matter of 8-10 seconds, then calm down over about 3-4 minutes I don't think the cgroup monitoring ever is able to 'catch' one of these incidences. At least, we've never gotten an alert or anything from it. The limits are indeed set for all subscriptions, though.

Magestyx

magestyx · Sep 14, 2022

While I was writing that last message, I happened to miss a spike on one server. It had already calmed back to normal by the time I was checking it, but out of curiosity I went through each and every subscription's CPU monitoring logs and not a single one registered anything out of the ordinary during that time.
So perhaps that's why cgroups isn't catching anything - maybe this isn't caused directly by the subscriptions or sites themselves.
But that's all that is on these systems - Plesk and sites, etc.
I don't know.

Cike76 · Sep 14, 2022

xmlrpc.php sometimes is used to generate attacks ( internal and external ) so it might be a good idea to block requests to that file.
Did you install Wordfence? this plugins eventually blocks bruteforce attempts at login.
Having a good cache strategy offloads CPU cycles dedicated to WP, have you done that? For example a site on a server got a post that went viral, server didin´t even flinched, all traffic went to the cache...

magestyx · Sep 15, 2022

Hi,

Cike76 said:
xmlrpc.php sometimes is used to generate attacks ( internal and external ) so it might be a good idea to block requests to that file.
Did you install Wordfence? this plugins eventually blocks bruteforce attempts at login.
Having a good cache strategy offloads CPU cycles dedicated to WP, have you done that? For example a site on a server got a post that went viral, server didin´t even flinched, all traffic went to the cache...

Hello.
Yes - we have blocks for all xmlrpc files in place, and we install wordfence and very good caching on all wordpress sites.
Since after yesterday's morning spike we looked through all hosting accounts monitoring stats immediately after and none of them showed a spike, we're wondering if this isn't something directly to do with the sites themselves. We've tried running all the commands mentioned above but nothing is jumping out as being an issue.
Can anyone share what Plesk PHP-FPM settings they use for their hosting accounts?
We've tried every mix we can think of - lower settings, higher settings for the pm children, etc. as well as ondemand, static, etc.
We haven't found any good/clear references or guides or examples, just ethereal & vague recommendations. Of course it has to do with other server configs so there's no one-size-fits-all, but still SOMETHING to go on would be good. These new Plesk versions are the first we've had with the PHP-FPM configurations.

magestyx · Sep 15, 2022

Peter Debik said:
Load spikes can always be explained. You just need to find the right hook where to catch them.

Next time it occurs, run
# watch "ps aux | sort -nrk 3,3 | head -n 20"
as a first response to find out which processes are consuming the most CPU time.

I also recommend to check all PHP-FPM processes in such a situation. Most likely you'll have a user that is using many PHP-FPM children and each of them has a high load.
# ps aux | grep php-fpm
That user is your culprit. From there descend into the logs directory of the user's subscription and check what's going on in the error_log and access_ssl_log. You'll probably find the cause there, e.g. frequent requests from bad bots or something similar.

Hi, Peter.

So we've been trying to use these commands, but so far nothing jumps out as a definite issue.
We currently have pm.max_children set quite low for all accounts on the systems, so none of them seem to be spawning that many additional children. The site speeds don't seem affected from tests so far.
According to the 'ps aux...' results, the highest CPU times are: fail2ban-server, kswapd0, nydus-ex-api, grafana-server, rcu_sched, migration/0
The php-fpm for hosting accounts themselves are only a small fraction of those.
kswapd0 is the process that jumps to 100% every time one of these spikes happens, but on CentOS7 the only setting we can find to affect it is swappiness , and we've tried numerous different settings with that.
We've been suspicious before that it could be a problem with Plesk itself or an extension, but we've found no settings anywhere that specifically limit Plesk's resource usage. All searches point to things like cgroups regarding individual hosting accounts or Service Plans.

Issue Need help » Plesk + CentOS + WP = PHP-FPM CPU @ 100%

mow

Silver Pleskian

Cike76

Basic Pleskian

magestyx

Basic Pleskian

Bitpalast

Plesk addicted!

magestyx

Basic Pleskian

Maarten

Golden Pleskian

WP Toolkit

alvarezcruz

Lead Engineer

Cike76

Basic Pleskian

magestyx:

magestyx

Basic Pleskian

WP Toolkit

magestyx

Basic Pleskian

magestyx

Basic Pleskian

magestyx

Basic Pleskian

Cike76

Basic Pleskian

magestyx

Basic Pleskian

magestyx

Basic Pleskian

Similar threads

Issue Need help » Plesk + CentOS + WP = PHP-FPM CPU @ 100%

Silver Pleskian

Basic Pleskian

Basic Pleskian

Plesk addicted!

Basic Pleskian

Golden Pleskian

Lead Engineer

Basic Pleskian

magestyx:​

Basic Pleskian

Basic Pleskian

Basic Pleskian

Basic Pleskian

Basic Pleskian

Basic Pleskian

Basic Pleskian

Similar threads

magestyx: