Issue Something changed badly after reboot

ic3_2k · Jun 19, 2024

Hi everybody!

This weekend the server stoped working, I connected to the KVM and a beautifull 'kernel panic' was wairign for me, luckly after a forces reboot the only problem I found (at first) was one of the disk was taken out the raid, I added it again after checking there was no errors on the disk, and everything look'd ok.

But after only 2 days the webpages started to going slower and slower, so I checked monitor and saw that something changed on hoy the system manage the memory after the reboot, as you can see before the reboot it memory consumption never was more than 18.5Gb and the swap was allways over 4.5Gb, after the reboot the memory consumption went crazy and the swap is likely not used as before:

Suscription CPU or Memory s the same as before:
,

Also plesk consumptions (reset is marked red):

Mysql is same as always:

Can someone helpme to know what is happening?

Thank you all!!!

Maarten · Jun 19, 2024

It could be the RAID configuration, i.e. that it's still rebuilding the RAID array.

Did you check the logfiles like /var/log/messages or the processes with htop or glances commands?

ic3_2k · Jun 20, 2024

hello, and thanks for the answer

The raid is in good health and the recostruction took over 36 minutes only:

Bash:

[root]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 nvme0n1p2[2] nvme1n1p2[1]
      1046528 blocks super 1.2 [2/2] [UU]

md3 : active raid1 nvme0n1p3[0] nvme1n1p3[1]
      493131776 blocks super 1.2 [2/2] [UU]
      bitmap: 4/4 pages [16KB], 65536KB chunk

unused devices: <none>
[root]# mdadm --query --detail /dev/md{2,3}
/dev/md2:
           Version : 1.2
     Creation Time : Tue Apr 30 11:45:43 2024
        Raid Level : raid1
        Array Size : 1046528 (1022.00 MiB 1071.64 MB)
     Used Dev Size : 1046528 (1022.00 MiB 1071.64 MB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Thu Jun 20 15:03:57 2024
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : md2
              UUID : 9590b356:2a61ab94:d17f87f2:68ff96d4
            Events : 7395

    Number   Major   Minor   RaidDevice State
       2     259        7        0      active sync   /dev/nvme0n1p2
       1     259        3        1      active sync   /dev/nvme1n1p2
/dev/md3:
           Version : 1.2
     Creation Time : Tue Apr 30 11:45:44 2024
        Raid Level : raid1
        Array Size : 493131776 (470.29 GiB 504.97 GB)
     Used Dev Size : 493131776 (470.29 GiB 504.97 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Thu Jun 20 15:05:34 2024
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : md3
              UUID : b187ede9:1b15301f:91acec36:93c42492
            Events : 513607

    Number   Major   Minor   RaidDevice State
       0     259        8        0      active sync   /dev/nvme0n1p3
       1     259        4        1      active sync   /dev/nvme1n1p3
[root]#

Also I don't see anything weird on htop beyond the consupmtion bumps you can see on the monitoring screenshots: now the cached memory grows up until eat all the available ram and once it reach that point, the cache and slab drops down again repeating this cycle over and over again.
I tied to seek on the dmes, journalctl and messages.log the messages happening in the time range when this happens and didn't see anything interesting...

Before this memory issue the server was running in stable for almost a year

Mysql, which usually is the source of memory problems has the same consumption as before the reboot:

Any other suggestion¿? Would be pleased to check anything you can suggest

Maarten · Jun 20, 2024

The sawtooth-like pattern in your server's real memory consumption can occur due to various reasons such as periodic tasks, garbage collection processes, or caching mechanisms. It's generally not a cause for concern unless you're experiencing performance issues or unexpected crashes.

Regarding your worries about the swap file, it’s important to understand how Linux manages memory. When it seems like your system is using a lot of swap, it might work as intended to optimize performance. I recommend reading this helpful page: Linux Ate My RAM for more detailed insights. This resource explains why high memory usage is often not a problem and how Linux handles memory management efficiently.

ic3_2k · Jun 25, 2024

I have no worries about swap....

My worries are about the poorly performing cutomer's webpages several hours per day, every single day, after the kernel panic happened

I only know that when web pages are behaving slowly, the RAM consumption is at one of the peaks that can be seen in the screenshot. I did mention the swap because it is the only other graph with visible changes. CPU is the same, MySQL is the same, etc.

Before this happening, RAM memory consumption was always more or less the same, with the total being below 18Gb at all times, as was the swap, which maintained a stable use of about 2.5 or 3Gb.

I know a little bit about linux and ram, let's say I'm at user level

If you show me a graph like this one I would say there's no problem with memory, the used memory is still the same over 9.31Gb so you have available over 21Gb (Cache+free) but the fact is webpages now are behaving slow up to the point to respond with '50X' errors every day and customers are angry

What I don't understand is why it behaved better before, when the cache was on the swap partition, than now storing the cache all in RAM.

Maarten · Jun 25, 2024

Without having access to your server it's very hard to tell what's going on.
You need to look at the logfiles of the sites that perform badly, i.e. the /var/www/vhosts/system/domain.com/logs/error_log and proxy_access_ssl_log.

If you are familiar with the command line, go to those locations and use the "tail" command to see what's going on:

Code:

# cd /var/www/vhosts/system/domain.com/logs
# tail -f error_log
# tail -f proxy_access_ssl_log

Hit Ctrl+C to stop the logging.
{/code]

Reload the site a few times and in the meantime have a look at the log files.

mow · Jun 27, 2024

The smart-log of your nvme ssds is okay?
What does iotop / htop say about iowait?
Did you ever find out the reason for the crash?
Are all fans running properly? Anything blocking the airflow?

Bitpalast · Jun 27, 2024

Did you verify that fail2ban has started? Sometimes it cannot start after a reboot, because log files that were previously there where meanwhile deleted. A server without fail2ban will get much useless traffic. This could cause a higher cpu and memory load.

ic3_2k · Jun 28, 2024

Thank for the answers!!!

@mow
Smart log shows no errors.

Also no high values at iotop

Didn't ever found the source of the kernel painc error... I had to reboot the server to put ir back at work asap loosing in the way all dmesg and kernel logs, they were overwritten by the new boot proccess messages

It is a bare-metal server on a datacenter, so it is few thoushand kilometers away than I feel comfortsble to go to check the ventilators ;-)
But what i can say is that temerature sensors shows values between 42ºC and 51ºC so I think ventilation must be ok

@Bitpalast
I have imunifi360 installed so fail2ban is disabled as there is incompatibilites between them

IDK I hate then this kind of things happens, since yesterday looks like the system start to behave normal again, so now I only know that something unknown caused a kernel panic and after that the server was behaving weird about 12 days...
So this can happens again at any moment

Issue Something changed badly after reboot

ic3_2k

New Pleskian

Maarten

Golden Pleskian

ic3_2k

New Pleskian

Maarten

Golden Pleskian

ic3_2k

New Pleskian

Maarten

Golden Pleskian

mow

Silver Pleskian

Bitpalast

Plesk addicted!

ic3_2k

New Pleskian