Issue 100% CPU

tnickoloff · Feb 21, 2024

Hello, My VPS gets 100% CPU usage caused by mariadb and sw-engine. I tried this solution: but when I start sw-engine I got 100% CPU usage again. I tried this couple of times and every time I stopped sw-engine, there was same stuck processes.
Later after a couple of hours (may be 3 or more), the problem gone and CPU usage was back to normal. In the next day it happens again.
This was last week. I didn't found the problem, so yesterday I reinstalled the OS and PLESK(image provided by datacenter), deployed the site and everything was good till 2 hours ago. Again 100% CPU usage.
What can I do? Please help. I can provide logs or whatever you need.

Peter Debik · Feb 21, 2024

Do you have enough free disk space on the / partition?

tnickoloff · Feb 21, 2024

Yes. 70GB free.

Peter Debik · Feb 21, 2024

In that case this is either caused by hanging sw-engine processes or sub processes or by a large number of incoming requests against the login page. Have you activated fail2ban to stop brute force attacks against :8443? Maybe you could also check the /var/log/sw-cp-server/error_log for additional information?

tnickoloff · Feb 21, 2024

It is activated fail2ban. Here is part of /var/log/sw-cp-server/error_log:

2024/02/21 12:33:21 [crit] 85549#0: accept4() failed (24: Too many open files)
2024/02/21 12:33:35 [crit] 85549#0: accept4() failed (24: Too many open files)
2024/02/21 12:34:05 [crit] 85549#0: accept4() failed (24: Too many open files)
2024/02/21 12:34:06 [crit] 85549#0: accept4() failed (24: Too many open files)
2024/02/21 12:35:21 [crit] 85549#0: accept4() failed (24: Too many open files)
2024/02/21 12:35:38 [crit] 85549#0: accept4() failed (24: Too many open files)
2024/02/21 12:38:10 [crit] 85549#0: accept4() failed (24: Too many open files)
2024/02/21 12:42:44 [crit] 85549#0: accept4() failed (24: Too many open files)
2024/02/21 13:07:14 [error] 361155#0: *7079581 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079593 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079605 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079561 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079595 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079579 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079596 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079620 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079594 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079590 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079630 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079611 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079612 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079624 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"
2024/02/21 13:07:14 [error] 361155#0: *7079623 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.0", upstream: "fastcgi://unix:/var/run/sw-engine.sock:", host: "MY_IP"

tnickoloff · Feb 21, 2024

It is activated fail2ban. Here is part of /var/log/sw-cp-server/error_log:

Peter Debik · Feb 22, 2024

"Too many open files" look suspicious. Make sure to apply at least the advice given in https://support.plesk.com/hc/en-us/...-reload-on-a-Plesk-server-Too-many-open-files, but also check whether you need to add "fs.file-max = <put high number here>" into /etc/sysctl.conf and add hard and soft limits to /etc/security/limits.conf, too, for example:

Code:

nginx soft nofile <high number here>
nginx hard nofile <high number here>
root soft nofile <high number here>
root hard nofile <high number here>
psaadm soft nofile <high number here>
psaadm hard nofile <high number here>
mysql soft nofile <high number here>
mysql hard nofile <high number here>
httpd soft nofile <high number here>
httpd hard nofile <high number here>

with "<high number here>" a fairly high number of a number of open files allowed on your system. For example: 100000. The actual number of open files on your system can be determined by running lsof | wc -l. Doe not exceed 1 Mio (1000000) in your configuration files, because on some OS, a larger number could break SSH root access (su/sudo).
After making the changes to the files, run systemctl --system daemon-reload && sysctl -p, then restart the services, e.g. service sw-engine restart && service sw-cp-server.

tnickoloff · Feb 22, 2024

Peter, thank you very much for your reply!
I did all what you wrote to me. I set 'Max open files' to 4096. Surprisingly for me the check before new setting showed two pretty different values:

Bash:

grep 'Max open files' /proc/$(cat /var/run/nginx.pid)/limits
Max open files            1024                 524288               files

I don't know is it normal or not.
The actual number of open files in this moment:

Bash:

lsof | wc -l
27637

But right now there is no domain pointing to this server. (After the failure yesterday I pointed to another server) CPU usage is 0% now.

I put the values you suggested in /etc/sysctl.conf and /etc/security/limits.conf files and restarted services.

This VPS serves only 1 site, 1 domain.

I appreciate your help. My question is: is there anything I should do before switch back the server to production? I really want to avoid "try catch" play.

Peter Debik · Feb 22, 2024

As of the information provided earlier I am certain that the issue is caused by the "too many files" situation. But who knows whether other problems exist. You might consider setting
fs.inotify.max_user_watches = 560144
fs.inotify.max_user_instances = 1024
(or other suitable high values for your server), in /etc/sysctl.conf, too.
For your webserver(s) you could consider running
# /usr/local/psa/admin/sbin/websrv_ulimits --set 500000 --no-restart
or another high number that fits your situation, so that they also have a high "max files" limit upon each start attempt.

Issue 100% CPU

tnickoloff

New Pleskian

Attachments

Peter Debik

Community Manager until 3/2024

tnickoloff

New Pleskian

Peter Debik

Community Manager until 3/2024

tnickoloff

New Pleskian

tnickoloff

New Pleskian

Peter Debik

Community Manager until 3/2024

tnickoloff

New Pleskian

Peter Debik

Community Manager until 3/2024

Similar threads