• If you are still using CentOS 7.9, it's time to convert to Alma 8 with the free centos2alma tool by Plesk or Plesk Migrator. Please let us know your experiences or concerns in this thread:
    CentOS2Alma discussion

Issue 504 bad gateway hard to debug

bruno vianna

New Pleskian
Hi

I'm having an issue since I updated to version 18.0.27 on Debian. Every morning, at 6:30, nginx starts to hang with a 504 error on all sites. Accessing plesk's panel in its own port works fine, though. I tried restarting nginx and apache with no avail. The only solution I've found is to reboot the server. Then it works fine until the next morning.

There seems to be a spike in CPU usage at this time, including the mysql daemon, but nothing out of extraordinary. I checked the access ips for dos attacks and there is nothing suspicious. It is a very low traffic site. It seems like plesk runs a backup at that time, but the scheduled backup is at 2am, not 6:30am. Also, I tried a manual backup and the system doesn't hang.

Any ideas of what could be going on ? I can post whatever logs might be useful to get this solved.

Thanks
B
 
Try to run system daily cron tasks one by one to find a reason:

# ll /etc/cron.daily/
total 36
-rwxr-xr-x 1 root root 211 Feb 19 2013 00webalizer
-rwxr-xr-x 1 root root 282 Jun 16 00:08 50plesk-daily
-rwxr-xr-x 1 root root 448 Jun 10 22:22 60sa-update
-rw-r--r-- 1 root root 152 Apr 24 10:19 awstats
-rwxr-xr-x 1 root root 57 Jun 15 16:47 dmarc-report
-rwxr-xr-x 1 root root 237 Jun 10 18:47 drweb-update
-rwx--x--x 1 root root 219 Apr 16 2019 logrotate
-rw-r--r-- 1 root root 618 Apr 16 2019 man-db.cron
-rw------- 1 root root 208 Apr 16 2019 mlocate
 
Thanks. I ran all tasks one by one and it did not hang. The crontab looks quite different than yours, though:

47 23 * * * /usr/sbin/ntpdate -b -s 2.pool.ntp.org
0 1 * * 1 /opt/psa/libexec/modules/watchdog/cp/secur-check
0 1 * * * /opt/psa/libexec/modules/watchdog/cp/send-report daily
10 1 * * * /opt/psa/libexec/modules/watchdog/cp/clean-sysstats
15 1 * * * /opt/psa/libexec/modules/watchdog/cp/pack-sysstats day
15 1 * * 1 /opt/psa/libexec/modules/watchdog/cp/pack-sysstats week
15 1 1 * * /opt/psa/libexec/modules/watchdog/cp/pack-sysstats month
15 1 1 * * /opt/psa/libexec/modules/watchdog/cp/pack-sysstats year
20 1 * * * /opt/psa/libexec/modules/watchdog/cp/clean-events
0 3 * * 7 /opt/psa/libexec/modules/watchdog/cp/clean-reports

Edit: I realize now this is a different cron - I'm running the ones from /etc/cron.daily now
 
I showed you the contents of the directory /etc/cron.daily/ on my test Plesk based on CentOS7 but not scheduled task in cron format. Feel the difference.
 
OK, so also I ran all script in /etc/cron.daily and the site did not hang. But it would make sense to be somthing in there, since it is scheduled for 6:25, right before the daily crash.
 
Have you considered to look into the Nginx log why Nginx is not starting? It is very well possible that Nginx needs to reload the configuration for some reason or to restart gracefully, for example after log rotation. If it then does not restart or crashes on a reload configuration attempt, the reason for that will be mentioned in the Nginx log file. Please post it here for further assistance.
 
Thanks for your help. I really can't find anything wrong in nginx access and error logs at the time it creashes. I'm pasting them below. The wget requests are my scripts that let me know when the system crashes. There is a series of errors at 7:02, but that is when I restart the system.



Access:
x.x.x.x - - [22/Jun/2020:05:40:01 +0200] "GET / HTTP/1.1" 301 162 "-" "Wget/1.20.3 (linux-gnu)"
x.x.x.x - - [22/Jun/2020:05:40:01 +0200] "GET / HTTP/1.1" 301 162 "-" "Wget/1.19.4 (linux-gnu)"
x.x.x.x - - [22/Jun/2020:05:49:42 +0200] "GET /wp-content/themes/CP20/doc.php HTTP/1.1" 301 162 "http://site.ru" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.99 Safari/533.4"
x.x.x.x - - [22/Jun/2020:05:49:44 +0200] "GET /wp-content/themes/CP20/doc.php HTTP/1.1" 301 162 "http://site.ru" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.99 Safari/533.4"
x.x.x.x - - [22/Jun/2020:05:50:01 +0200] "GET / HTTP/1.1" 301 162 "-" "Wget/1.20.3 (linux-gnu)"
x.x.x.x - - [22/Jun/2020:05:52:17 +0200] "GET /wp-content/uploads/2020/05/doc.php HTTP/1.1" 301 162 "http://site.ru" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.99 Safari/533.4"
x.x.x.x - - [22/Jun/2020:06:00:00 +0200] "GET / HTTP/1.1" 301 162 "-" "Wget/1.19.4 (linux-gnu)"
x.x.x.x - - [22/Jun/2020:06:00:01 +0200] "GET / HTTP/1.1" 301 162 "-" "Wget/1.20.3 (linux-gnu)"
x.x.x.x - - [22/Jun/2020:06:09:05 +0200] "GET / HTTP/1.1" 200 1471 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0"
x.x.x.x - - [22/Jun/2020:06:10:01 +0200] "GET / HTTP/1.1" 301 162 "-" "Wget/1.20.3 (linux-gnu)"
x.x.x.x - - [22/Jun/2020:06:20:01 +0200] "GET / HTTP/1.1" 301 162 "-" "Wget/1.19.4 (linux-gnu)"
x.x.x.x - - [22/Jun/2020:06:20:01 +0200] "GET / HTTP/1.1" 301 162 "-" "Wget/1.20.3 (linux-gnu)"
x.x.x.x - - [22/Jun/2020:06:23:02 +0200] "GET / HTTP/1.1" 301 162 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +Googlebot - Search Console Help)"
x.x.x.x - - [22/Jun/2020:06:30:01 +0200] "GET / HTTP/1.1" 301 162 "-" "Wget/1.20.3 (linux-gnu)"

Error:
2020/06/22 07:02:28 [alert] 669#0: *4847 open socket #31 left in connection 8
2020/06/22 07:02:28 [alert] 669#0: *4843 open socket #26 left in connection 10
2020/06/22 07:02:28 [alert] 669#0: *4845 open socket #29 left in connection 15
2020/06/22 07:02:28 [alert] 669#0: *4846 open socket #30 left in connection 17
2020/06/22 07:02:28 [alert] 669#0: *4781 open socket #23 left in connection 20
2020/06/22 07:02:28 [alert] 669#0: *4829 open socket #25 left in connection 22
2020/06/22 07:02:28 [alert] 669#0: *4832 open socket #28 left in connection 23
2020/06/22 07:02:28 [alert] 669#0: *4842 open socket #24 left in connection 24
2020/06/22 07:02:28 [alert] 669#0: *4841 open socket #3 left in connection 25
2020/06/22 07:02:28 [alert] 669#0: *4844 open socket #27 left in connection 29
2020/06/22 07:02:28 [alert] 669#0: aborting
 
When it crashes, have you tried to run
# service nginx restart
to restart it? I mean, instead of rebooting the whole system? Normally, there should be a log excerpt when you run a restart command that points to the reason of the issue.

If a restart like shown above works, as a dirty "hot fix" you could create a cron job that runs the command automatically at 6.27 am for example until the true cause of it can be found and solved.
 
Yes. I tried restarting nginx and apache separately and at the same time, it doesn't work.

The way I have it running now is having a cron job that reboots the system every day. But it doesn't seem like a proper solution.
In fact, for some reason the cron job from within plesk (both through control panel or CLI) doesn't work, so I have another machine ssh'ing into and rebooting it.

Thanks for the help anyway!
 
And there is no log excerpt, failure or warning message when you try to start Nginx manually in that error situation? That should be, and it almost certainly tells the reason why it cannot restart.
 
That should not be possible. I remember a case however, where the local IP addresses where not whitelisted in Fail2Ban which lead to this issue. Actually, everything is working, but Fail2Ban is blocking the local IP of the server. When Apache tries to connect to Nginx, it uses the local machine's IP. If that is blocked by the firewall, it cannot connect, hence Nginx shows the 504 gateway error, because it does not get a response from Apache. So you might want to look into the Fail2Ban configuration whether your local IP(s) are whitelisted for Fail2Ban.
 
Yes, the fail2ban is one of the first suggestions I've found in forums. The machine ip is listed as trusted. And anyway, why would it only ban at 6:30 am? The whole day it works properly and stops serving out of nowhere.
 
Could it be possible that you have a backup job running at the time, and in that backup job configured that websites are being disabled while the backup is running?
 
I do think the blame is on one of the jobs that run at 6:30, or a combination of them. I tried running all the jobs individually, but none of them caused the error. I just changed the time to 5:30 to see if the theory is right, let's see what happens tomorrow.

And if it was just the backup job overload, shouldn't it go back to normal afterwards? It keeps giving me a 504 until I reboot.
 
Just to follow up on this, indeed, when I changed the scheduled crons to 5:30, the bad gateway also started at that time. I'll look deeper into the scripts.
 
And there is no log excerpt, failure or warning message when you try to start Nginx manually in that error situation? That should be, and it almost certainly tells the reason why it cannot restart.

I think that this post of mine was a bit misleading, sorry for that. Did you also try to restart Apache manually and what error messages do you find there? Because the 504 is not caused by Nginx, but by Apache not responding.
 
Back
Top