• The APS Catalog has been deprecated and removed from all Plesk Obsidian versions.
    Applications already installed from the APS Catalog will continue working. However, Plesk will no longer provide support for APS applications.
  • Please be aware: with the Plesk Obsidian 18.0.78 release, the support for the ngx_pagespeed.so module will be deprecated and removed from the sw-nginx package.

Issue Clarification needed: Inconsistency between Resource Limits (CPU 100%), Monitoring, and Process List (Server Crashes)

grafiman

New Pleskian
Server operating system version
AlmaLinux 9.7
Plesk version and microupdate number
18.0.76
Hi everyone,

I am experiencing a critical issue regarding resource consumption on my server, and I’m having trouble understanding the discrepancies between Plesk's different monitoring tools and the actual resource limits.

My Server & Setup:

  • Total Cores: 24 Cores
  • Subscription Setup: On a specific website, I have set the limits in "RAM, CPU, Disk I/O" as follows:
    • CPU Limit: 100%
    • PHP-FPM: pm.max_children set to 10
The Issue:Despite these strict limits, when this specific site gets hit by aggressive crawlers (like FB-BLOCK or similar bots), the total CPU usage in the system-wide Process List spikes to 80-90%, ultimately causing the entire 24-core server to crash or become totally unresponsive.

My Questions:

  1. CPU Limit Definition: It is my understanding that a CPU Limit of 100% restricts the subscription to the equivalent of exactly 1 CPU core. If my server has 24 cores, how is it mathematically possible for this single subscription to consume 80-90% of the total system CPU and crash the server?
  2. Dashboard Inconsistencies: There seems to be a massive disconnect between what is displayed in Subscription Monitoring, the limits enforced by cgroups (RAM, CPU, Disk I/O), and the actual real-time Process List. Why aren't these limits strictly containing the spike, and why do the indicators show different realities?
Are the limits failing to apply to child processes, or is database (MySQL/MariaDB) CPU usage completely bypassing these subscription limits?

I would really appreciate it if someone could explain the architecture behind how these three components (Limits, Monitoring, Process List) interact, and how I can strictly isolate this site so it never takes down the other 23 cores.

Thank you in advance!
 
Hi, @grafiman .
CPU Limit Definition: It is my understanding that a CPU Limit of 100% restricts the subscription to the equivalent of exactly 1 CPU core. If my server has 24 cores, how is it mathematically possible for this single subscription to consume 80-90% of the total system CPU and crash the server?

Your understanding is correct. One of the reasons that could be causing the issue is the fact that Cgroups Manager does not limit the resources Plesk extensions and custom manually installed services consume. Could you please provide more details on your observation of the processes running for the domain name in question when there is a spike in the CPU consumption? You can try htop or top -c

Dashboard Inconsistencies: There seems to be a massive disconnect between what is displayed in Subscription Monitoring, the limits enforced by cgroups (RAM, CPU, Disk I/O), and the actual real-time Process List. Why aren't these limits strictly containing the spike, and why do the indicators show different realities?
If I understand the question correctly, this is caused because Cgroups Manager counts the average consumption for each 5-minute interval rather than in real time. You can find more details here.

Are the limits failing to apply to child processes, or is database (MySQL/MariaDB) CPU usage completely bypassing these subscription limits?

MySQL/MariaDB operates as a single process, so there is no way to monitor a specific subscription's usage in the monitoring tools.
 
Using Cgroups isn't as helpful these days, it was a solution to solve a problem from years past but today's extremely aggressive bots can easily overwhelm servers in less than minutes or faster than the the Cgroups detection interval.

You noted that pm.max_children is set to 10 (assuming that this is at a plan and applied to a an unlocked subscription or set manually at the subscription level) but what do you have pm.max_requests set to? You may want workers to be recycled so try adjusting down to something like 200.

Since those bots are retrying faster than PHP-FPM can even recycle workers, to help avoid CPU thrashing and overload, it might be a good idea to also implement Nginx throttling by IP (limit the concurrent connections per IP allowing small bursts and returning 429 when limits are hit). This means that you'd rate-limit requests before they even reach PHP at the socket level.
 
CPU Limit Definition: It is my understanding that a CPU Limit of 100% restricts the subscription to the equivalent of exactly 1 CPU core. If my server has 24 cores, how is it mathematically possible for this single subscription to consume 80-90% of the total system CPU and crash the server?

Hi,

keep in mind that I've only managed VPS with 4 to 8 cores, but in my very limited experience I've sometimes encountered that setting the CPU limit too low for certain sites have the opposite desired effect (I rarely go below 150% in an 8 core machine these days because of that).
 
Using Cgroups isn't as helpful these days, it was a solution to solve a problem from years past but today's extremely aggressive bots can easily overwhelm servers in less than minutes or faster than the the Cgroups detection interval.

You noted that pm.max_children is set to 10 (assuming that this is at a plan and applied to a an unlocked subscription or set manually at the subscription level) but what do you have pm.max_requests set to? You may want workers to be recycled so try adjusting down to something like 200.

Since those bots are retrying faster than PHP-FPM can even recycle workers, to help avoid CPU thrashing and overload, it might be a good idea to also implement Nginx throttling by IP (limit the concurrent connections per IP allowing small bursts and returning 429 when limits are hit). This means that you'd rate-limit requests before they even reach PHP at the socket level.
The pm.max_requests setting is set to 0 by default for some accounts and to 1000 for others, but problems occur in both cases.
 
Hi, @grafiman .


Your understanding is correct. One of the reasons that could be causing the issue is the fact that Cgroups Manager does not limit the resources Plesk extensions and custom manually installed services consume. Could you please provide more details on your observation of the processes running for the domain name in question when there is a spike in the CPU consumption? You can try htop or top -c


If I understand the question correctly, this is caused because Cgroups Manager counts the average consumption for each 5-minute interval rather than in real time. You can find more details here.



MySQL/MariaDB operates as a single process, so there is no way to monitor a specific subscription's usage in the monitoring tools.
Thank you for your answer. I didn't use `htop` or `top -c` at the time. I looked at the Process List and there was a table like this:
Name/Command = php-fpm
Subscription = example.com
CPU Usage = 2.5-3% (there were 10 of these) and other calculations were normal (around 0.2%). mariadbd was around 0.8%. In this case, I was faced with a table in the process list showing Total CPU usage of 80% (75% by domains). As a result, I saw that this situation was caused only by bot traffic (mostly Meta bots).
 
Back
Top