Question Tracking down periodic high usage

Denis Gomes Franco · Mar 26, 2021

Hello everyone. I need some help in identifying periodic extremely high usage in one of my servers (however, I have two and the same issue happens on the other as well).

From time to time CPU and memory usage goes through the roof:

That Grafana log is for the past 24 hours, note that it happened three more times. The "valley" in the middle is me rebooting the server. That last spike was happening right now and as misteriously as it started, it has now ended.
This server has fairly large specs: 12 cores and 48 GB RAM, located at Upcloud. The other server is way smaller and is located at Linode. Both rarely go over 75% CPU usage at any point in time. All of them host Wordpress and Woocommerce sites developed by us. We run a managed hosting business so we know personally each and every site owner, and no customer has access to the Plesk panel.
We have the New Relic agent installed on both servers but I'm fairly new to this tool and I am not so sure how to use it to debug things. Anyway, I have set up the agent so that New Relic will show statistics for each site independently, instead of aggregated inside one "PHP APPLICATION" group.
Outbound network traffic did not seem to go up beyond what's considered normal, so I don't think it's a DDOS but I might be wrong.
All of the sites are low in traffic. Peak usage does not seem to correlate with any exceptional events (eg, a shop owner throwing a sale and bringing in lots of visitors).
All sites are running on PHP-FPM and NGINX with MariaDB 10. Apache is completely disabled for all sites. I would even uninstall Apache but Plesk wont let me.

Apache CPU usage, Apache & PHP-FPM memory usage and MySQL CPU usage all go fairly wild, while MySQL memory usage stays more or less the same.
Using the Process List and MySQL Process List does not seem to yield any useful information, just as HTOP.
All plugins used in all sites are always kept up to date, including Elementor which had a vulnerability fixed in the last few days.

I'm aware that this case will require some digging, so I don't expect a solution right away, but if someone can point me in the right direction or provide any useful info, I would be very grateful.

Denis Gomes Franco · Mar 26, 2021

Some more info:
We have REDIS installed and Wordpress set up to use object caching
Nearly all of the sites use WP SUPER CACHE
Sites dont go full offline with spikes but they do get very, very slow
While spiking, HTOP shows all cores at nearly 100%. The process list does not show a single process taking up most of the CPU, eg., I can't tell that it's a specific website using up all the resources.
We considered using Cloudlinux but seeing that we do managed hosting and maintenance (instead of just reselling hosting to customers) then I failed to see any benefits.

Denis Gomes Franco · Mar 26, 2021

Just read about this: Recently Patched Vulnerability in Thrive Themes Actively Exploited in the Wild

We don't use Thrive Themes and the spikes in load started before this, but I thought about posting this here.

weltonw · Mar 26, 2021

New Relic is your friend here. Are you tracking Apache stats?

If so, run

Code:

SELECT average(`net.requestsPerSecond`), average(`server.busyWorkers`) FROM ApacheSample SINCE 48 Hours AGO TIMESERIES

to start. Also see if:

Code:

SELECT max(cpuPercent) FROM ProcessSample FACET commandLine SINCE 48 Hours AGO TIMESERIES

and

Code:

SELECT max(memoryResidentSizeBytes) FROM ProcessSample FACET commandLine SINCE 48 Hours AGO TIMESERIES

Turns up anything interesting.

weltonw · Mar 26, 2021

Denis Gomes Franco said:
Just read about this: Recently Patched Vulnerability in Thrive Themes Actively Exploited in the Wild

We don't use Thrive Themes and the spikes in load started before this, but I thought about posting this here.

That's completely unrelated.

I'd also check your PHP FPM and NGINX logs

weltonw · Mar 26, 2021

Sorry, you're running NGINX.

Try: SELECT average(`net.requestsPerSecond`), average(`net.connectionsActive`) FROM NginxSample SINCE 48 Hours AGO TIMESERIES

Denis Gomes Franco · Mar 26, 2021

Hey John, thanks for the reply. That query yielded no results, also I don't know how to check the number of connections via New Relic. However I am checking the dashboard right now looking at a past incident and the number of network packets didn't seem to have fluctuated.

weltonw · Mar 26, 2021

What about the processSample queries?

Denis Gomes Franco · Mar 26, 2021

New Relic is really fairly new to me LOL I'm not familiar with these queries yet.

Anyway, the problem started again a few minutes ago at 19:41 (local time), I'm currently watching to find clues.

weltonw · Mar 26, 2021

Check FPM logs. /var/log/plesk-phpversion-fpm/

The queries can be run by going to Query Your Data up in the corner.

Denis Gomes Franco · Mar 26, 2021

There are only error logs in this directory. It's filling quite rapidly with warnings. Seems like some plugins could use a little update but we already keep them up to date anyway. Not sure though if this is related to the high usage.

weltonw · Mar 26, 2021

Denis Gomes Franco said:
There are only error logs in this directory. It's filling quite rapidly with warnings. Seems like some plugins could use a little update but we already keep them up to date anyway. Not sure though if this is related to the high usage.

Well, it can't hurt to post them. What processes are running? (ps / htop / top).

Denis Gomes Franco · Mar 26, 2021

Okay John, thanks to your direction I think I might have found the culprit. The log files are showing lots of warnings but one site seemed to stand out from the others. I temporarily suspended it and CPU usage dropped. I'll investigate further and see if this is indeed the case.

Denis Gomes Franco · Mar 26, 2021

Alright, ten minutes in and things have improved vastly. Looks like the problem was being caused by the Product Addons plugin being outdated (WooCommerce Product Add-Ons - Custom & Personalized Products) on this site. Thing is, the Wordpress toolkit *did not* mark this plugin as needing an update...

Question Tracking down periodic high usage

Denis Gomes Franco

Regular Pleskian

Denis Gomes Franco

Regular Pleskian

Denis Gomes Franco

Regular Pleskian

weltonw

Regular Pleskian

weltonw

Regular Pleskian

weltonw

Regular Pleskian

Denis Gomes Franco

Regular Pleskian

weltonw

Regular Pleskian

Denis Gomes Franco

Regular Pleskian

weltonw

Regular Pleskian

Denis Gomes Franco

Regular Pleskian

weltonw

Regular Pleskian

Denis Gomes Franco

Regular Pleskian

Denis Gomes Franco

Regular Pleskian

Similar threads