• Plesk Uservoice will be deprecated by October. Moving forward, all product feature requests and improvement suggestions will be managed through our new platform Plesk Productboard.
    To continue sharing your ideas and feedback, please visit features.plesk.com

Resolved Login and systemctl problems

chrisgch

New Pleskian
Server operating system version
Almalinux 9.6
Plesk version and microupdate number
18.0.71 Update #2
Our server has been running Plesk for many years without problems. Currently we are on Almalinux 9.6, which was a fresh Almalinux 9 install (no upgrade from CentOS).
We are currently on Plesk Obsidian 18.0.71 with automatic updates and system updates, safe updates enabled.

Suddenly a few days ago the server started to have a lot of problems:
1. Login to Plesk takes more than a minute.
2. Login via SSH also takes a long time: The user name and password prompt are shown immediately, "Last login" comes after 55 seconds, login completes after 1.5 minutes.
3. Connections via FTP/FTPS fail (no response from server).
4. Trying to submit changed firewall rules shows "Applying firewall configuration" for >1 day.
5. Checking for the task with command plesk db "SELECT id,type,status,finishTime FROM longtasks"
shows the task as not_started.
6. All systemctl (as user or root) calls on the command line fail with "Connection timed out", even "systemctl show". Therefore I can't restart the FTP server.

According to the "top" command, system load is not higher than usual (0.75).
The web server (Apache with NGINX) works fine and is very fast.

Any idea what could be the reason, and how to fix it? I'm worrying that systemd is broken and a reboot would fail.
 
Thanks for your reply! Unfortunately I get "-- No entries --" running this command as root, even when I just use
journalctl -u systemd
I also tried
journalctl -r
to see all the last entries immediately after a failed systemctl call, but there was nothing in there except for the usual failed login attempts from hackers/scripts.

netstat indicates that the ftp server isn't listening on port 21, so that explains why ftps fails to connect. I can't start it because of the systemctl timeout.
I can change the firewall rules by clicking on "Preview this script", then putting the script in a .sh file and running it from the command line. However, this didn't fix the issues.
 
The command
ps aux | wc -l
returns 208, which doesn't seem higher than usual.
The server is a dedicated Core i7-6700 with 64GB RAM and two SSDs (RAID 1) running a forum and download server. Uptime is 414 days, so maybe I should really try a reboot.

For testing, I have taken the forum offline, which has reduced the load from 0.8 to 0.1, but the number of processes remained the same.

I was able to start proftpd manually by changing the ServerType from inetd to standalone. However, login still takes way over a minute, even when using a local ftp client (or telnet to port 21) on the same server. The delay occurs after entering the password.
 
Thanks for checking.
There are several possible root causes:
  • Broken systemd/dbus (corrupted state, deadlock).
  • Exhausted resources (load average, disk I/O (did you check this?), memory pressure (did you check whether so much RAM swapping takes place that all other processes become slow because they have to wait on it?)).
  • Hanging DNS resolution (affecting SSH login, services waiting on reverse lookups, e.g. wrong nameservers configured in /etc/resolv.conf, IPv6/IPv4 resolution issue or maybe a route misconfigured so that it contains several IPs, some of which are not pointing to your server).
  • Firewall or networking misconfiguration causing systemd to hang when trying to communicate over dbus.

Step 1) Look for OOM kills, full disk, or CPU bottlenecks:
top -n 1
free -h
df -h
dmesg | tail -n 50

Step 2) Check dbus/systemd:
ps aux | grep systemd
ps aux | grep dbus
If systemctl hangs, try:
strace -p $(pidof systemd)
to see what systemd is waiting on.

Step 3) Check DNS:
SSH delay is often caused by reverse DNS lookup. Test:
getent hosts $(hostname -I | awk '{print $1}')
getent hosts <your-public-IP>
If this hangs, fix /etc/hosts so that server IP resolves to its hostname:
123.45.67.89 myserver.mydomain.com myserver
Also check local TCP connectivity:
ss -tlnp | egrep "21|8443|22"

Step 4) Try a direct dbus query:
dbus-send --system --dest=org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.DBus.Peer.Ping
If that also hangs, dbus or systemd is broken.

Step 5) If systemd is unrecoverable:
If systemctl/dbus/systemd is deadlocked, most recovery attempts from SSH won’t work. You may need to reboot the server from your hosting provider’s control panel (hard reboot). Boot into rescue mode if reboot doesn’t fix it, then check logs:
journalctl -xe --no-pager
If you go for a reboot, immediately after the reboot check systemd information, to find out more on what might have caused issues:
systemctl --failed
journalctl -p 3 -xb
 
Thanks for the instructions!

Step 1)
Both memory (RAM) and disk space are plenty, no OOM kills. Only 4GB of RAM out of 64GB are used, 47GB cache (large forum database). Disk is only half full.
dmesg returns error "Failed to send WATCHDOG=1 notification message:
systemd-journald[<redacted>]: Transport endpoint is not connected

Step 2)
systemd and dbus are running. systemd seems to be hanging. strace -p $(pidof systemd) returns:
strace: Process 1 attached
waitid(P_ALL, 0,
and then it hangs (Ctrl+C after 10 minutes).

Step 3)
getent doesn't hang and returns valid replies. nslookup also returns a result immediately.
Port 21 isn't listening as described above.

Step 4)
No idea what this supposed to send to systemd, but it returns immediately without any reply or error.

Step 5)
Not tried yet, will try during the weekend for the case that the server no longer comes up and needs to be reinstalled. That would be a real pain because our hoster no longer provides Linux images with pre-installed Plesk.
 
Sounds like a Hetzner device, as they ditched Plesk. No worries about a Plesk re-installation. You absolutely do not need an image with pre-installed Plesk, because the Plesk installation routine can easily be run from the console on a minimal operating system installation. You'll have this up in minutes. The real pain is the restore of the full backup ... That can take hours.

Does the server have hardware raid? If so, there might be an issue with the RAID controller, but you'd see this in the /var/log/messages output or the raid controller's log. You could also run edac-util to check whether there is faulty RAM. If the number of checksum errors on RAM is high (e.g. tens of thousands of error corrections) trading the RAM module could be required.
 
Yes, it's a Hetzner device, it's unfortunate that they ditched Plesk. Since they don't offer paid support, they should at least offer an easy way to maintain the server. When switching from CentOS to Almalinux, I didn't restore from the Plesk backup (although I have one off site). I just restored the settings, and then copied back the files and re-imported the MySQL databases from dumps.

The server uses software raid only. I have just performed a check with /usr/share/mdadm/mdcheck and observed the progress via cat /proc/mdstat. It ended without errors:
Aug 29 14:24:41 host.ghisler.ch kernel: md: data-check of RAID array md0
Aug 29 14:24:41 host.ghisler.ch kernel: md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
Aug 29 14:24:41 host.ghisler.ch kernel: md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Aug 29 14:27:29 host.ghisler.ch kernel: md: md0: data-check done.
Aug 29 14:45:02 kernel: md: md2: data-check done.
Aug 29 14:45:02 host.ghisler.ch kernel: md: data-check of RAID array md1
Aug 29 14:45:07 host.ghisler.ch kernel: md: md1: data-check done.
The RAM doesn't have ECC, but we have used that machine for many years without problems.
 
If the RAM doesn't have ECC, the only way to make sure it is working correctly is to run a tool like Memtest86+ (open-source) or MemTest86 (PassMark’s free version). Most Linux distros (Debian/Ubuntu, RHEL, etc.) include memtest86+ as a GRUB option — check your boot menu. Otherwise, download the ISO/USB image from Memtest86+ | The Open-Source Memory Testing Tool. The problem however with this is, that you'll need to run four passes, and these might take hours to complete. If you cannot run this from the emergency boot shell, you can install "memtester" in production operations and test the RAM that is not occupied by other programs. Another option could be to run stress-ng and let it test RAM for an hour. It might crash the server though (this is how you know that RAM has a problem). None of these options are really comfortable.
 
Does RAM really go bad after many years of service, without any kind of problems except for systemd? This doesn't really make sense to me...
 
I can say for sure that RAM modules go bad. Some only have rare glitches, but some might produce hundreds of thousands of errors. With ECC you will not notice them at first (unless the problems are really big). If you do not have ECC, you will experience issues with some services, store or read I/O processes for sure once in a while.

I do not know whether RAM is the issue in your case, but most other causes seem to having been ruled out already, so what's left is an issue with disk access or RAM integrity. As software RAID is used, an issue with disk access is unlikely in this case, which leaves RAM as the next most likely trouble source.

There could also be an issue with the CPU itself (e.g. the kernel might shut down 10 out of 12 cpu cores, this will definitely cause a slow system). However, you'd see clearly visible temperature warnings or CPU unit "on"/"off" switch statements in the /var/log/messages log for that. As you did not report anything suspicious like it, RAM again is the only remaining hardware trouble source.

As always, these are only educated guesses.
 
Fortunately I was able to reboot the server with
systemctl --force --force reboot
and now everything seems to be back to normal. Login is as fast as before, and systemctl works too.
Since this is a production system, I will use memtester to test the RAM. A short test with 1GB didn't reveal any errors yet.

Thanks to everyone who helped!
 
Back
Top