Issue Apache hangs with no load every 2-4 days

nkomarov · Jun 25, 2021

Hi there,
I have a VPS at a German provider Strato, they rolled out Ubuntu with Plesk pre-installed. Currently it's Plesk Obsidian Web Admin Edition Version 18.0.36.

Every 2-4 days early in the morning (like, at 3-34 or 4-10 AM) Apache stops responding to external requests. There is a trivial installation of WordPress + WooCommerce with some plugins, nothing special. No load whatsoever because we're only preparing the shop, only authenticated users go beyond "Under Construction" page - but still it hangs, I also see some strange memory usage graph. Nothing special in the Apache error log. Apache restart helps for 2-4 days, then it happens again. I checked what cron tasks are there but there is nothing in this timeframe.

Running with PHP version 7.3.28 (available 7.4.20), run PHP as FPM application.

Could I enable some additional logging? I could also try daily restart of Apache but I'd better try to find the true reason of this issue.

Bitpalast · Jun 25, 2021

Could this be an automatic Wordpress update routine?

nkomarov · Jun 25, 2021

Thanks for an idea! That also was one of my primary suspitions so the first thing I did was moving it closer ot 1:00 AM. By the time this happens, that wp-cron.php has already finished hours ago. But it still happens.

weltonw · Jun 25, 2021

Nothing in any logs? Is there any Apache restart/reload? Are you hitting request worker limits?

pleskpanel · Jun 26, 2021

There are three places to check:

The website's error log in Plesk
Enable the WordPress error log file
The error log for your PHP handler

For a site that isn't yet launched, this sounds more like an uncontained code leak however it's too soon to tell.

Be sure to check to ensure that Apache is set to gracefully restart as well. While the issuance of SSL certificates should not be daily (unless there are failures each time) and this setting likely has nothing to do with your issue, it's still a good practice.

nkomarov · Jun 28, 2021

Yes, the checkbox of graceful restart has been on since day 1.

When site was unreachable I went to the /admin/services/list page of Plesk and could see that the service is running so I thought it is "hanging" but I feel like it was wrongly shown as running whilst actually crashed. Seems like an issue in Plesk: it only shows the service as "stopped" if it has been stopped from this page; it doesn't check if services are actually running upon page loading.

I've installed the Watchdog extension and found out that these crashes happened more often but I just didn't notice that. It's been also down/up during the day but I only could see it since I've installed Watchdog. I'm not sure if restarts happened before too. I only know that before there were some early morning crashes without restart.

In the access/error log I see:

2021-06-27 11:02:01

Access

200

GET /wp-cron.php?doing_wp_cron HTTP/1.1

And soon I received per email: "Apache is down. The problem was discovered on Jun 28, 2021 11:04 AM."

Then it hangs again but without a corresponding wp-cron call: "The problem was discovered on Jun 28, 2021 11:26 AM. "

Then there is a wp-cron call without a correponding crash...

2021-06-27 12:02:01

Access

85.214.71.120

200

GET /wp-cron.php?doing_wp_cron HTTP/1.1

So there is no strict correlation between these. I had a define('DISABLE_WP_CRON', true); record in config.php but it didn't work... Now I've changed it to define('DISABLE_WP_CRON', 'true'); - hope it will ultimately disable wp-cron.

Typing "journalctl -u apache2.service --since today --no-pager" returned (by the way, is there a way to see these things from Plesk itself?):

Jun 28 03:26:32 XXX.stratoserver.net systemd[1]: apache2.service: Control process exited, code=exited status=2
Jun 28 03:26:32 XXX.stratoserver.net systemd[1]: Reload failed for The Apache HTTP Server.
Jun 28 03:26:45 XXX.stratoserver.net systemd[1]: Stopping The Apache HTTP Server...
Jun 28 03:26:45 XXX.stratoserver.net apachectl[32077]: /usr/sbin/apachectl: 98: /usr/sbin/apachectl: Cannot fork
Jun 28 03:26:45 XXX.stratoserver.net systemd[1]: apache2.service: Control process exited, code=exited status=2
Jun 28 03:26:56 XXX.stratoserver.net systemd[1]: apache2.service: Failed with result 'exit-code'.
Jun 28 03:26:56 XXX.stratoserver.net systemd[1]: Stopped The Apache HTTP Server.
Jun 28 03:26:56 XXX.stratoserver.net systemd[1]: Starting The Apache HTTP Server...
Jun 28 03:26:57 XXX.stratoserver.net systemd[1]: Started The Apache HTTP Server.
Jun 28 03:27:00 XXX.stratoserver.net systemd[1]: Reloading The Apache HTTP Server.
Jun 28 03:27:01 XXX.stratoserver.net systemd[1]: Reloaded The Apache HTTP Server.
Jun 28 11:04:48 XXX.stratoserver.net systemd[1]: Stopping The Apache HTTP Server...
Jun 28 11:04:48 XXX.stratoserver.net systemd[1]: apache2.service: Control process exited, code=exited status=2
Jun 28 11:05:59 XXX.stratoserver.net systemd[1]: apache2.service: Failed with result 'exit-code'.
Jun 28 11:05:59 XXX.stratoserver.net systemd[1]: Stopped The Apache HTTP Server.
Jun 28 11:05:59 XXX.stratoserver.net systemd[1]: Starting The Apache HTTP Server...
Jun 28 11:06:00 XXX.stratoserver.net systemd[1]: Started The Apache HTTP Server.
Jun 28 11:26:13 XXX.stratoserver.net systemd[1]: Stopping The Apache HTTP Server...
Jun 28 11:26:13 XXX.stratoserver.net systemd[1]: apache2.service: Control process exited, code=exited status=2
Jun 28 11:26:23 XXX.stratoserver.net systemd[1]: apache2.service: Failed with result 'exit-code'.
Jun 28 11:26:23 XXX.stratoserver.net systemd[1]: Stopped The Apache HTTP Server.
Jun 28 11:26:24 XXX.stratoserver.net systemd[1]: Starting The Apache HTTP Server...
Jun 28 11:26:24 XXX.stratoserver.net systemd[1]: Started The Apache HTTP Server.
Jun 28 12:13:22 XXX.stratoserver.net systemd[1]: Stopping The Apache HTTP Server...
Jun 28 12:13:22 XXX.stratoserver.net apachectl[17974]: /usr/sbin/apachectl: 98: /usr/sbin/apachectl: Cannot fork
Jun 28 12:13:22 XXX.stratoserver.net systemd[1]: apache2.service: Control process exited, code=exited status=2
Jun 28 12:13:33 XXX.stratoserver.net systemd[1]: apache2.service: Failed with result 'exit-code'.
Jun 28 12:13:33 XXX.stratoserver.net systemd[1]: Stopped The Apache HTTP Server.
Jun 28 12:13:33 XXX.stratoserver.net systemd[1]: Starting The Apache HTTP Server...
Jun 28 12:13:33 XXX.stratoserver.net systemd[1]: Started The Apache HTTP Server.

So I tried "tail -n 500 /var/log/apache2/error.log | grep err", this returned:

(dozens of same messages like these three below)
[Mon Jun 28 11:05:18.795193 2021] [mpm_event:error] [pid 32146:tid 140438082923456] (11)Resource temporarily unavailable: AH00481: fork: Unable to fork new process
[Mon Jun 28 11:05:28.810980 2021] [mpm_event:error] [pid 32146:tid 140438082923456] (11)Resource temporarily unavailable: AH00481: fork: Unable to fork new process
[Mon Jun 28 11:05:38.829117 2021] [mpm_event:error] [pid 32146:tid 140438082923456] (11)Resource temporarily unavailable: AH00481: fork: Unable to fork new process
[Mon Jun 28 11:05:58.192783 2021] [core:error] [pid 32146:tid 140438082923456] AH00046: child process 900 still did not exit, sending a SIGKILL
[Mon Jun 28 11:05:58.214747 2021] [core:error] [pid 32146:tid 140438082923456] AH00046: child process 6388 still did not exit, sending a SIGKILL
[Mon Jun 28 11:10:48.391960 2021] [pagespeed:error] [pid 15530:tid 140681175623424] [mod_pagespeed 1.13.35.2-0 @15530] [0628/111048:ERROR:worker.cc(127)] Unable to start worker thread
[Mon Jun 28 11:24:14.819923 2021] [pagespeed:error] [pid 15531:tid 140682769450752] [mod_pagespeed 1.13.35.2-0 @15531] [0628/112414:ERROR:worker.cc(127)] Unable to start worker thread
[Mon Jun 28 11:26:22.525757 2021] [core:error] [pid 15526:tid 140683605941184] AH00046: child process 15530 still did not exit, sending a SIGKILL
[Mon Jun 28 12:08:57.393869 2021] [pagespeed:error] [pid 16067:tid 140594638726912] [mod_pagespeed 1.13.35.2-0 @16067] [0628/120857:ERROR:worker.cc(127)] Unable to start worker thread
[Mon Jun 28 12:08:57.400368 2021] [pagespeed:error] [pid 16067:tid 140594638726912] [mod_pagespeed 1.13.35.2-0 @16067] [0628/120857:ERROR:worker.cc(127)] Unable to start worker thread
[Mon Jun 28 12:13:31.988520 2021] [core:error] [pid 16062:tid 140595272010688] AH00046: child process 16067 still did not exit, sending a SIGKILL

Why do I see nothing like this in the Apache Logs in Plesk?

So I've disabled Apache module pagespeed and restarted Apache2. Will see what happens.

Monty · Jun 28, 2021

fork: Unable to fork new process

=> Your VPS is running out of resources, most probably related to kmemsize limits set by Strato. Check your /var/log/messages (or /var/log/syslog, /var/log/kernel.log) for any errors related to resources.

I think you'll have to contact Strato support about this. It looks like your VPS needs more resources....

nkomarov · Jun 28, 2021

Monty said:
fork: Unable to fork new process

=> Your VPS is running out of resources, most probably related to kmemsize limits set by Strato. Check your /var/log/messages (or /var/log/syslog, /var/log/kernel.log) for any errors related to resources.

I think you'll have to contact Strato support about this. It looks like your VPS needs more resources....

I checked htop and it shows that only 1G of 8G RAM is used. I've changed PHP memory limit from 256M to 512M (usage in htop raised slightly) and hope it will also help.

I'll check these files tomorrow, thanks.

EDIT: after 6 hours I don't see further errors like shown above. Maybe 256M limit in PHP settings was not high enough.

Ehud · Jun 28, 2021

Plesk Version 18.0.36 MIGHT have a BUG, POSSIBLY related to Plesk changing some memory allocation configuration, without user consent.

Also have a look at:

Question - Plesk Obsidian - 18.0.36 and nginx worker connection issues

Hi Guys, I recently took a server from Plesk Obsidian 18.0.35 to .36, everything appeared to be fine but on a daily basis the sites go offline for 10/15 minutes at a time and return a 500 internal nginx error. Looking at the nginx logs I can see the below entry continually in the error_log...

talk.plesk.com

weltonw · Jun 28, 2021

nkomarov said:
I checked htop and it shows that only 1G of 8G RAM is used. I've changed PHP memory limit from 256M to 512M (usage in htop raised slightly) and hope it will also help.

I'll check these files tomorrow, thanks.

EDIT: after 6 hours I don't see further errors like shown above. Maybe 256M limit in PHP settings was not high enough.

You're not running mod_php/cgi are you?

Edit: Ignore my reply. I re-read your initial thread

Edit2: Have you checked your system logs? There's lots of things that can prevent forking - ie, kernel/security limits . I'd also take a look at some monitoring to see just how many processes are being forked, and why that's happening.

It's quite odd that raising the PHP mem limit would help. Do keep us updated.

Edit3: In your initial graph, you show multiple spikes in memory usage. Do these coincide with Apache restarts?

nkomarov · Jun 29, 2021

The spikes are still there, today at 3:06 AM, it might be daily backup. On the other hand, there were spikes in "Apache & php-fpm usage" at 2:51 and 4:01 AM.

There were some restarts as well:

journalctl -u apache2.service --since today --no-pager
-- Logs begin at Wed 2021-05-19 17:14:36 CEST, end at Tue 2021-06-29 11:18:54 CEST. --
Jun 29 03:26:06 XXX.stratoserver.net systemd[1]: Reloading The Apache HTTP Server.
Jun 29 03:26:07 XXX.stratoserver.net systemd[1]: Reloaded The Apache HTTP Server.
Jun 29 03:26:12 XXX.stratoserver.net systemd[1]: Reloading The Apache HTTP Server.
Jun 29 03:26:13 XXX.stratoserver.net systemd[1]: Reloaded The Apache HTTP Server.

Well, at least no more "Unable to fork new process" messages.

And in the Apache logs there are only ModSecurity errors like "Access denied" because of hackers were trying some XSS-prone URLs.

And cron-wp.php is called no more, last edit helped.

I also restarted mysqld, memory consumption fell 200M lower.

Ehud · Jun 29, 2021

Server spikes around midnight or 3/4 AM happen due to cron runs.

Host provider may bring instance to go off-line when resources pass a certain limit as 60% of "burstable" server capacity on AWS.

weltonw · Jun 29, 2021

nkomarov said:
The spikes are still there, today at 3:06 AM, it might be daily backup. On the other hand, there were spikes in "Apache & php-fpm usage" at 2:51 and 4:01 AM.
View attachment 18975

There were some restarts as well:

Well, at least no more "Unable to fork new process" messages.

And in the Apache logs there are only ModSecurity errors like "Access denied" because of hackers were trying some XSS-prone URLs.

And cron-wp.php is called no more, last edit helped.

I also restarted mysqld, memory consumption fell 200M lower.

Reloads are normal. As long as there isn't any downtime, Apache reloads are to be expected. The spikes are interesting - anything in logs around that time? I would also look into exactly what is causing those spiked - PHP or Apache?

Issue Apache hangs with no load every 2-4 days

nkomarov

New Pleskian

Attachments

Bitpalast

Plesk addicted!

nkomarov

New Pleskian

weltonw

Regular Pleskian

pleskpanel

Regular Pleskian

nkomarov

New Pleskian

Monty

Silver Pleskian

nkomarov

New Pleskian

Ehud

Basic Pleskian

Question - Plesk Obsidian - 18.0.36 and nginx worker connection issues

weltonw

Regular Pleskian

nkomarov

New Pleskian

Ehud

Basic Pleskian

weltonw

Regular Pleskian

Similar threads