• Please be aware: Kaspersky Anti-Virus has been deprecated
    With the upgrade to Plesk Obsidian 18.0.64, "Kaspersky Anti-Virus for Servers" will be automatically removed from the servers it is installed on. We recommend that you migrate to Sophos Anti-Virus for Servers.
  • The Horde webmail has been deprecated. Its complete removal is scheduled for April 2025. For details and recommended actions, see the Feature and Deprecation Plan.
  • We’re working on enhancing the Monitoring feature in Plesk, and we could really use your expertise! If you’re open to sharing your experiences with server and website monitoring or providing feedback, we’d love to have a one-hour online meeting with you.

Question Tracking down periodic high usage

Denis Gomes Franco

Regular Pleskian
Hello everyone. I need some help in identifying periodic extremely high usage in one of my servers (however, I have two and the same issue happens on the other as well).

From time to time CPU and memory usage goes through the roof:
1616763084734.png
That Grafana log is for the past 24 hours, note that it happened three more times. The "valley" in the middle is me rebooting the server. That last spike was happening right now and as misteriously as it started, it has now ended.
This server has fairly large specs: 12 cores and 48 GB RAM, located at Upcloud. The other server is way smaller and is located at Linode. Both rarely go over 75% CPU usage at any point in time. All of them host Wordpress and Woocommerce sites developed by us. We run a managed hosting business so we know personally each and every site owner, and no customer has access to the Plesk panel.
We have the New Relic agent installed on both servers but I'm fairly new to this tool and I am not so sure how to use it to debug things. Anyway, I have set up the agent so that New Relic will show statistics for each site independently, instead of aggregated inside one "PHP APPLICATION" group.
Outbound network traffic did not seem to go up beyond what's considered normal, so I don't think it's a DDOS but I might be wrong.
All of the sites are low in traffic. Peak usage does not seem to correlate with any exceptional events (eg, a shop owner throwing a sale and bringing in lots of visitors).
All sites are running on PHP-FPM and NGINX with MariaDB 10. Apache is completely disabled for all sites. I would even uninstall Apache but Plesk wont let me.
1616764360720.png
Apache CPU usage, Apache & PHP-FPM memory usage and MySQL CPU usage all go fairly wild, while MySQL memory usage stays more or less the same.
Using the Process List and MySQL Process List does not seem to yield any useful information, just as HTOP.
All plugins used in all sites are always kept up to date, including Elementor which had a vulnerability fixed in the last few days.

I'm aware that this case will require some digging, so I don't expect a solution right away, but if someone can point me in the right direction or provide any useful info, I would be very grateful.
 
Some more info:
We have REDIS installed and Wordpress set up to use object caching
Nearly all of the sites use WP SUPER CACHE
Sites dont go full offline with spikes but they do get very, very slow
While spiking, HTOP shows all cores at nearly 100%. The process list does not show a single process taking up most of the CPU, eg., I can't tell that it's a specific website using up all the resources.
We considered using Cloudlinux but seeing that we do managed hosting and maintenance (instead of just reselling hosting to customers) then I failed to see any benefits.
 
New Relic is your friend here. Are you tracking Apache stats?

If so, run

Code:
SELECT average(`net.requestsPerSecond`), average(`server.busyWorkers`) FROM ApacheSample SINCE 48 Hours AGO TIMESERIES

to start. Also see if:

Code:
SELECT max(cpuPercent) FROM ProcessSample FACET commandLine SINCE 48 Hours AGO TIMESERIES
and
Code:
SELECT max(memoryResidentSizeBytes) FROM ProcessSample FACET commandLine SINCE 48 Hours AGO TIMESERIES

Turns up anything interesting.
 
Sorry, you're running NGINX.

Try: SELECT average(`net.requestsPerSecond`), average(`net.connectionsActive`) FROM NginxSample SINCE 48 Hours AGO TIMESERIES
 
Hey John, thanks for the reply. That query yielded no results, also I don't know how to check the number of connections via New Relic. However I am checking the dashboard right now looking at a past incident and the number of network packets didn't seem to have fluctuated.
1616798891809.png
 
New Relic is really fairly new to me LOL I'm not familiar with these queries yet.

Anyway, the problem started again a few minutes ago at 19:41 (local time), I'm currently watching to find clues.
 
Check FPM logs. /var/log/plesk-phpversion-fpm/

The queries can be run by going to Query Your Data up in the corner.
 
There are only error logs in this directory. It's filling quite rapidly with warnings. Seems like some plugins could use a little update but we already keep them up to date anyway. Not sure though if this is related to the high usage.
 
There are only error logs in this directory. It's filling quite rapidly with warnings. Seems like some plugins could use a little update but we already keep them up to date anyway. Not sure though if this is related to the high usage.
Well, it can't hurt to post them. What processes are running? (ps / htop / top).
 
Okay John, thanks to your direction I think I might have found the culprit. The log files are showing lots of warnings but one site seemed to stand out from the others. I temporarily suspended it and CPU usage dropped. I'll investigate further and see if this is indeed the case.
 
Back
Top