Issue init at 100% CPU and pleskrc very slow to reload processes

websavers · Jul 2, 2018

You may experience this issue in a few different ways:
- The `init` process (or systemd) sits at 100% CPU for lengthy periods of time
- You discover that pleskrc takes a very long time to run, even just a simple command like /usr/local/psa/admin/sbin/pleskrc httpd status will take a good 7-10 seconds rather than be instant.
- Logging in via SSH can take a good 30 seconds or longer

This is an indirect result of pleskrc's call to `/bin/systemctl list-unit-files` on line 304. Apparently this is a systemd bug described here (actually a bug in dbus) wherein there's a leak of scope units causing list-unit-files to take a very long time to compile and display its list. The workaround is to run the following command to clean up the scope unit sessions:

Code:

find /run/systemd/system -name "session-*.scope" -delete
rm -rf /run/systemd/system/session*scope*
systemctl | grep "abandoned" | grep -e "-[[:digit:]]" | sed "s/\.scope.*/.scope/" | xargs systemctl stop

The bug on the systemd github page is marked closed because it's being presented by systemd, but the real issue appears to be with dbus. That was fixed in April 2017 with dbus version 1.11.10, however CentOS 7.5 uses 1.10.24 and presumably has not cherry picked this fix into their 1.10.24 distribution of dbus.

I wonder if perhaps to work around this Plesk should use the commands above to clear our session scope files before running list-unit-files or avoid using list-unit-files until this is repaired with an update to dbus in CentOS?

futureweb · Jul 16, 2018

@IgorG - do you have any Feedback to this?
Same Problem (init @ 100% CPU) already affected multiple Plesk Onyx Servers of us - some Servers also needed a reboot to get back to normal ...
thx, bye from Austria
Andreas Schnederle-Wagner

websavers · Jul 16, 2018

futureweb said:
@IgorG - do you have any Feedback to this?
Same Problem (init @ 100% CPU) already affected multiple Plesk Onyx Servers of us - some Servers also needed a reboot to get back to normal ...
thx, bye from Austria
Andreas Schnederle-Wagner

The conversation here on the Redhat bug tracker indicates that they backported the systemd workaround, but doesn't make it clear if they've backported the dbus core fix yet.

Based on the session descriptors that appear in my systemd when this issue occurs, my best guess is that this is occurring when there are many repeated connections (possibly proftpd connections) that don't get cleared properly.

Note that you *can* run those commands above to solve the issue without a reboot. If there is a process that triggered the problem (basically any pleskrc command) then even after running those commands you'll need to either kill that process or wait it out, then init will go back to normal until those session descriptors build up again. It seems to happen on our busier CentOS 7 servers roughly every 1-2 days.

futureweb · Sep 7, 2018

problem is that I can't babysit a few dozens Servers to issue those commands if needed ...

Just happened again - Apache Webserver was down a few Minutes during httpd reload after a customer changed something in Plesk ... resulting in 300 customers websites down ...

@IgorG - any feedback from officials on this?

thx
Andreas

IgorG · Sep 9, 2018

Yes, it is known issue. It was already reported here Forwarded to devs - CGroups / systemd scope units slowing down
Corresponding bugreport PPPM-7414 was submitted and KB article was published Systemd causes high CPU load

futureweb · Sep 25, 2018

Hey @IgorG,
unfortunately the Workaround listed in KB isn't really working ... on 2 Servers I tried it:

CT-404 /# systemctl | grep abandoned | grep -e "[[:digit:]]" | sed "s/.scope.*/.scope/" | xargs systemctl stop
Failed to list units: Connection timed out
Too few arguments.

Andreas

futureweb · Sep 25, 2018

after server restart it's working ... had some servers with over 50.000 abandoned scopes ...

When will PPPM-7414 be fixed? Doesn't seem to be a minor thing which can be "ignored" half a year or longer ...

IgorG · Sep 25, 2018

futureweb said:
When will PPPM-7414 be fixed?

As far as I know, it should be fixed in 17.9

websavers · May 4, 2019

The KB article Igor linked to here has an updated script that should work without affecting system services in anyway now that it only runs on sessions that haven't recently spawned processes. Note our changes in the comments below to ensure it *only* kills abandoned sessions.

We've done a bit more testing on this recently and confirmed that while there's always a few abandoned sessions from the root user, the majority of them come from a system user account (shell) that has a site actively using php-fpm. When we had a particular offender's site set to PM mode static with 5 processes, we encountered *many* abandoned sessions. Setting that same site to php-fpm ondemand mode results in far fewer abandoned sessions, yet still more than other sites which don't experience any abandoned sessions. Perhaps it's when a php-fpm process expires due to timeouts? Further diagnostics will be required.

Update: this *may* not be php-fpm specific. It seems to be occurring on sites which are running old PHP code (like old versions of WordPress) and are experiencing lots of PHP errors in the logs and/or mod sec errors. We *think* the one causing problems on one of our servers is using FastCGI mode, hence the possibility of it not always being php-fpm related. This does tend to further our suspicions that it may be occurring when a PHP process doesn't complete/terminate successfully or correctly. More updates to come as we continue troubleshooting.

websavers · Mar 28, 2020

The underlying dbus issue that was pinpointed as the cause of this bug (for most users) was fixed in RedHat and CentOS with a dbus update months ago, yet the workaround cron script in the Plesk Support article you provided are still needed. Has there been any indication as to why this is the case? In other words, have Plesk devs discovered the Plesk-specific cause of this scope unit / abandoned session leak?

moswak · Aug 5, 2020

I'm beginning to suspect that this also has something to do with the kernel used and the virtualization platform.
With Virtuozzo 6 there was never a problem and the cron script was not necessary.
We migrated some servers to virtuozzo 7 and suddenly all servers have this bug and the cron deletes hundreds of sessions every hour.
All centos7/8 Server with Plesk Obsidian.

websavers · Aug 5, 2020

moswak said:
I'm beginning to suspect that this also has something to do with the kernel used and the virtualization platform.
With Virtuozzo 6 there was never a problem and the cron script was not necessary.
We migrated some servers to virtuozzo 7 and suddenly all servers have this bug and the cron deletes hundreds of sessions every hour.
All centos7/8 Server with Plesk Obsidian.

Vz7 around here too. I take it you were running CentOS 7 containers on the Virtuozzo 6 nodes (2.6.32 kernel) prior to migration to Virtuozzo 7?

It's quite possible that running CentOS 7 with systemd on kernel 2.6.32 simply handles systemd sessions differently (like a form of backwards compatible emulation) than with kernel 3.10.x where it's handled natively.

Or it could be that the Virtuozzo Linux 7 packages haven't incorporated the fixes from upstream.

moswak · Aug 5, 2020

I take it you were running CentOS 7 containers on the Virtuozzo 6 nodes (2.6.32 kernel) prior to migration to Virtuozzo 7?

Yes, the CT`s with centos 7 ran on VZ 6 without systemd problems. The problems have only existed since the CT's were migrated to VZ 7.

websavers · Aug 24, 2020

There's a comment on an unrelated bug report (seems to have been a glitch) that is referenced in passing as these versions supposedly containing the fixes for this systemd session leakage problem:

cat /etc/virtuozzo-release
Virtuozzo release 7.0.13 (136)
uname -a
Linux server.com 3.10.0-1062.7.1.vz7.130.12 x86_64 GNU/Linux
vzctl --version
vzctl v.7.0.209-1.vz7
rpm -qa | egrep "systemd|dbus"
systemd-219-67.vl7.2.6.x86_64
systemd-libs-219-67.vl7.2.6.x86_64
systemd-python-219-67.vl7.2.6.x86_64
systemd-sysv-219-67.vl7.2.6.x86_64
dbus-1.10.24-13.vl7.1.x86_64
dbus-glib-0.100-7.vl7.1.x86_64
dbus-python-1.1.1-9.vl7.1.x86_64

These versions were released on March 31st. The last time we encountered this problem was in March, so it's possible it's fixed now. Granted we don't have many CentOS 7 containers any longer that *don't* have the cron job set up to auto-repair this.

@moswak any chance the Virtuozzo/OpenVZ 7 node(s) you're still encountering this problem upon might still be running an older kernel and/or versions of systemd and dbus?

moswak · Aug 26, 2020

the versions used are "more up-to-date" than you indicated. the bug is still there. However, one can say that there are only a few servers where the cron cleans a lot of sessions. most servers only clean a handful sessions if they run every 3-6 hours.

websavers · Sep 23, 2020

On the OpenVZ bug trakcer, one of the Virtuozzo devs indicated the final fix for this will be included in the kernel released along with Virtuozzo 7.0 Update 15, which is expected to be released at the end of October. Error - bugs.openvz.org

Given this info, I'm guessing that the issue was indeed resolved in CentOS and RHEL earlier this year, but it took this long for the Virtuozzo devs to fully pull in all the changes necessary to fix it.

websavers · Mar 6, 2024

This issue was fixed with a Virtuozzo update, as indicated above. We were able to remove our workarounds for all CentOS 7 boxes.

Now a very similar issue has occurred with AlmaLinux 9 where there are thousands of abandoned sessions to the point they reach the session limit of 8192. The very same workaround above clears the abandoned sessions.

This time around all abandoned sessions are Plesk user sessions, which could be tied to PHP-FPM processes. They can be seen with: `systemctl | grep abandoned`

As the abandoned sessions build up, so too does load on the server, just like before, to the point that systemd-logind or initd (or both) are eating up 100% CPU on one core. When a Plesk update occurs, it causes processes to fail to restart, presumably because the system has reached its limit on sessions in which to start the process.

With CentOS 7 these abandoned sessions were from all types of system services, however this time it appears to be limited to cgroups sessions created strictly by Plesk system user accounts. This makes it unclear if the underlying cause is kernel and/or dbus code like before; it seems like it might be the Plesk cgroups controller causing this. Perhaps related to this issue.

Peter Debik · Mar 6, 2024

I cannot confirm a high number of abandoned sessions on either Alma 8 or 9 (dedicated) production servers. On tests I only see a single one. So maybe it's an issue with Virtuozzo?

websavers · Mar 6, 2024

Indeed - it certainly was the last time. It appears to be connected to cgroups sessions clashing with the default dbus timeout. Yet it only occurs on busier servers. I think it's going like this:

1. There must be many processes generating short-running cgroups sessions (perhaps php-fpm ondemand is the most likely source of that)
2. The process hangs waiting for the dbus timeout
3. Meanwhile additional sessions are being created by other processes, each one contributing to a slower and slower response from logind while it tries to close all the sessions. Ultimately they all get stuck in the closing state as they build towards the max sessions limit 8192.

Less busy servers don't run into the problem as there's less processes sticking around until dbus timeout, so it doesn't reach a critical mass of closing sessions.

Presumably the underlying issue is whatever is causing the processes to be stuck in the closing state in the first place. That part I haven't figured out yet - perhaps an OpenVZ kernel bug.

This wasn't occurring on CentOS 7 any longer with nearly identical Plesk config and domains (prior to migration to AlmaLinux 9) and it does not appear to occur on our AlmaLinux 8 servers.

websavers · Mar 7, 2024

@Peter Debik curious if anyone at Plesk knows what the circumstances are that lead to needing this dbus timeout adjustment. As in, do you know what environment that is expected to apply to? I'd definitely be curious to learn if the KB article was created from a report on another OpenVZ container, especially considering the underlying systemd bug that the article links to should have been fixed years ago (and certainly was for CentOS 8 back around 2021).

Issue init at 100% CPU and pleskrc very slow to reload processes

Regular Pleskian

Regular Pleskian

Regular Pleskian

Regular Pleskian

Plesk addicted!

Regular Pleskian

Regular Pleskian

Plesk addicted!

Regular Pleskian

Regular Pleskian

Regular Pleskian

Regular Pleskian

Regular Pleskian

Regular Pleskian

Regular Pleskian

Regular Pleskian

Regular Pleskian

Community Manager until 3/2024

Regular Pleskian

Regular Pleskian

Similar threads