• We value your experience with Plesk during 2024
    Plesk strives to perform even better in 2025. To help us improve further, please answer a few questions about your experience with Plesk Obsidian 2024.
    Please take this short survey:

    https://pt-research.typeform.com/to/AmZvSXkx
  • The Horde webmail has been deprecated. Its complete removal is scheduled for April 2025. For details and recommended actions, see the Feature and Deprecation Plan.
  • We’re working on enhancing the Monitoring feature in Plesk, and we could really use your expertise! If you’re open to sharing your experiences with server and website monitoring or providing feedback, we’d love to have a one-hour online meeting with you.

Question Plesk Obsidian on OVH dedicated server: Smart-ATA-Error

D4NY

Regular Pleskian
Hello everyone,
this question is not directly related to the Plesk software but very important if someone else has his Plesk running on a software raid on OVH.
Maybe someone can recommend me the best option to follow with this problem.
Last weekend the server went suddenly offline and i got this message:

CPU cooling improvement Date 2022-03-26 01:49:33 CET (UTC +01:00), CPU cooling improvement: Server found freezing. After checking the cooling system is not working anymore. Replacement of the cooling module. Workshop test ok. Hardware check ko: SMART_ATA_ERROR : SMART ATA errors found on disk PN21xxxxxxxxxx (Total : 2). The disk S/N: PN21xxxxxxxxxx needs to be replaced. Please contact OVH support to schedule a replacement. Server login on disk. Ping ok.
After about 1 hour from the problem, the OVH support replaced the cooling system and restarted the server that is now running but i still have the problem of the disk.
So my question to every Plesk server holder is if anyone has already encountered this kind of problem and if changing the disk can be a bigger problem than changing the whole server or running for some month in this situation. My server has 3 x 2TB disk with Ovh raid software.
I don't know how OVH raid software works, and if the replacement is absolutely necessary. My nightmare is to autorize disk replacement and having the server offline without data.
From OVH they say simply that every task of backup, restore and raid software management has to be done by the customer so no help apart the disk replacement.

This kind of error what does it indicates?

Running the cat /proc/mdstat command i got:
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md3 : active raid1 sda3[0] sdb3[1] sdc3[2]
1932506048 blocks [3/3] [UUU]
bitmap: 4/15 pages [16KB], 65536KB chunk

md2 : active raid1 sda2[0] sdb2[1] sdc2[2]
20478912 blocks [3/3] [UUU]
so it seems that all 3 phisical disk are running

Any idea that can help me to keep my Plesk healty, safe and running?
 
My nightmare is to autorize disk replacement and having the server offline without data.
Welcome to the club.

From OVH they say simply that every task of backup, restore and raid software management has to be done by the customer so no help apart the disk replacement.
All data centers do, because else they'd be liable for damage to the data. They cannot possible assume that liability, because they don't know how your disks are configured etc.

This kind of error what does it indicates?
Install "smartmontools", then use # smartctl for your disks to determine what's wrong. However, if a software explicitely asks for a disk replacement, it is probably right about the assessment.

so it seems that all 3 phisical disk are running
I only see two, where is the third and why do you have hard disk mirroring (raid 1) when you have three disks? What about the third disk?

Some general advaice for your disk change:
1) Backup your partition table, e.g. for a device "sda" it would be
# /usr/sbin/sgdisk --backup=/home/backup_partition_table /dev/sda
2) Backup your master boot record
# dd if=/dev/sda of=/home/backup_mbr bs=512 count=1
3) Backup your raid configuration. If you have software raid, read in the manual how to backup the configuration with your specific raid software.
4) Backup grub, e.g.
# cp -R /etc/grub.d /home
Download all the backups so that you can restore data in rescue mode of your server should it not reboot after the disk change.
5) Normally you'd tell the RAID that you plan to remove one of the disks. If there is an option to set a disk "offline" or "to be removed", do that in your raid software.
After the data center has changed the disk, you need to tell your RAID that it shall integrate that new disk into the existing RAID and rebuild the RAID from the source.
 
Thank you Peter for your fast reply as usual.

running FDISK -L i got:
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sda: 2000.4 GB, 2000398934016 bytes, 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: 7443B301-1F02-484E-AB6E-C60F0B086BD0


# Start End Size Type Name
1 40 2048 1004.5K BIOS boot bios_grub-sda
2 4096 40962047 19.5G Linux RAID primary
3 40962048 3905974271 1.8T Linux RAID primary
4 3905974272 3907020799 511M Linux swap primary
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes, 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: 64C77718-2B2A-4353-84E7-FBFF7BDDEC17


# Start End Size Type Name
1 40 2048 1004.5K BIOS boot bios_grub-sdb
2 4096 40962047 19.5G Linux RAID primary
3 40962048 3905974271 1.8T Linux RAID primary
4 3905974272 3907020799 511M Linux swap primary
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes, 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: 1195F706-9018-4576-801E-CB63AB208D00


# Start End Size Type Name
1 40 2048 1004.5K BIOS boot bios_grub-sdc
2 4096 40962047 19.5G Linux RAID primary
3 40962048 3905974271 1.8T Linux RAID primary
4 3905974272 3907020799 511M Linux swap primary

Disk /dev/md2: 21.0 GB, 20970405888 bytes, 40957824 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/md3: 1978.9 GB, 1978886193152 bytes, 3865012096 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/docker-9:3-95422324-pool: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 65536 bytes / 65536 bytes

I think we have 3 disk in RAID 1 mode, take a look at this note on my OVH dashboard:
Server SYS-BF-1 - Intel Xeon E3-1245v2 (France) / E3-1231v3 (Canada) - 32GB DDR3 1333MHz - 3x 2To HDD SATA Soft RAID

Your instructions are to be used in case i decide to proceed with replacing the disk (backup, restore...). At the moment I am still in the earlier stage of understanding all the possibilities. In particular I found a link (in Italian: HERE) that says that due to high temperatures smart ata errors can occur. Since my server's cooling system was broken, could this be a consequence of the high temperature and not the disk failure? Can I repeat this test in your opinion?

How can i check if all disk are running and Raid 1 is mirroring data on all 3 disks?
 
UPDATE

running smartctl -l error /dev/sdc i got
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged

running smartctl -l error /dev/sdb i got
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged

running smartctl -l error /dev/sda i got
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 2
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 40623 hours (1692 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 77 89 8a fa 02 Error: UNC at LBA = 0x02fa8a89 = 49973897

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 a0 80 99 fa 40 00 03:51:26.270 READ FPDMA QUEUED
60 80 98 00 99 fa 40 00 03:51:26.270 READ FPDMA QUEUED
60 80 90 80 98 fa 40 00 03:51:26.269 READ FPDMA QUEUED
60 80 88 00 98 fa 40 00 03:51:26.269 READ FPDMA QUEUED
60 80 80 80 97 fa 40 00 03:51:26.268 READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 39280 hours (1636 days + 16 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 0a 76 fb 70 0c Error: UNC at LBA = 0x0c70fb76 = 208730998

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 d8 80 14 71 40 00 07:42:37.912 READ FPDMA QUEUED
60 80 d0 00 0d 71 40 00 07:42:37.912 READ FPDMA QUEUED
60 80 c8 80 07 71 40 00 07:42:37.911 READ FPDMA QUEUED
60 80 c0 00 07 71 40 00 07:42:37.911 READ FPDMA QUEUED
60 80 b8 80 06 71 40 00 07:42:37.911 READ FPDMA QUEUED

So the problem seems to be on SDA. But i'm honest i can't understand this log and the way in which it is related to the OVH warning after replacing the cooling system.....
 
Disk and RAID setup:
3 disks and RAID 1 = haven't seen that yet, but seems to be possible. In that case you have one master and two mirrors of the same master. Normally with 3 disks you'd want to have RAID 5, but maybe, o.k., for some reasons you might have two mirrors of one disk, so having three in total. If that is the case, It is very unlikely that anything will go wrong with a disk swap. Things could get nasty when rebuilding the array after the swap. I cannot assist with that and think it'll depend on your RAID software configuration and requirements. On most hardware controllers a rebuild is automatically started when a new disk in the array is detected, but not sure about your RAID. You ought to make sure however, that you have really created two mirrors.

smartctl output:
UNC = "uncorrectable error", meaning that too many bits are flipped so that the built-in error correction cannot recover the data how it ought to look.
It is not soooo super bad as your drive has been up a long time. Such errors may occur, because many drives have physical errors. The big question is: How does your RAID software handle them?

I can only say that from hardware RAID controllers, they also detect such errors on patrol reads, correct them (not the error itself, but restore the data integrity from the mirror) and then write the defective area locations on the disk to a table in their memory so that these won't be used again when writing new data to the disk. After some while the table fills up with more and more defective areas, but it is absolutely possible to have hundreds of such errors in there. It simply means that the disk won't have its full capacity any longer, but the reduction is neglectible.

But: I do not know whether your RAID setup can do the same. If it stores a table for defective sectors so that new write operations to these areas are avoided, and if it is configured to do patrol reads that fix RAID integrity errors I'd probably say that with only two errors on one out of three disks you do not need to change the disk at this time. But if your RAID is not capable of memorizing the errors or restoring the RAID integrity from the error-free disks, or if the number of errors increases with in a short period of time, it is most likely better to swap the faulty disk.
 
Thanks for your explanation, very helpful.
Yes, 3 disks on raid 1 is a possible configuration. The max disk failure is N-1 so with 3 disk the server should run with 2 disk damaged. But obviously i don't want to test if it's true.

Honestly i don't know if my Ovh Raid Software is capable of restoring Raid integrity. As told having Plesk unavailable is a nightmare so we can't do this kind of task blind. Somewhere i've found that read errors are not so imporant like write errors and i got only 2. Is it possible to understand when they have been logged? Because there are the total working hours of the disk but nothing about WHEN the problem has been found.

Running SMARTCTL -T SHORT /DEV/SDA no errors are found. (fast test)
Now i'm running SMARTCTL -T LONG /DEV/SDA that should last 5 or 6 hours. Let's see if something new happens. (full test)
 
Error 2 occurred at disk power-on lifetime: 40623 hours (1692 days + 15 hours)
...
Error 1 occurred at disk power-on lifetime: 39280 hours (1636 days + 16 hours)
I think that "# smartctl -A /dev/<device name>" should give you the power-on lifetime of the disk so that you can compare it to the hours when the errors occured.
 
You the Man! Very interesting....

running SMARTCTL - A /DEV/SDA i got
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 80
3 Spin_Up_Time 0x0007 128 128 024 Pre-fail Always - 485 (Average 489)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 53
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 145 145 020 Pre-fail Offline - 24
9 Power_On_Hours 0x0012 094 094 000 Old_age Always - 45369
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 53
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1084
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 1084
194 Temperature_Celsius 0x0002 181 181 000 Old_age Always - 33 (Min/Max 21/63)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
This means that SDA hard disk has 45369 hours running life.
But this means also that errors in the log are much older because they happened at 40623 and 39280 hour.
Doing a simple math calc (45369 - 40623) / 24 i get that first read error is 198 days old and second is 253.
Now at least we know that cooling system failure is not related to the smart ata alert.
 
Assuming you can choose the disk configuration for a new Plesk dedicated server which one of this two options you prefer and why?
1. SoftRAID 4x2TB SATA
2. SoftRAID 2x480GB SSD
3. SoftRAID 2x450GB NVME
 
The mistake is the software RAID. Its problem is that all the RAID operations need to be performed by the cpu. If the cpu is loaded up with many other tasks, it will slow RAID down considerably, too. Or - if the niceness is set right - the RAID software transactions will slow down the rest of the system. Either way. The slowest possible choice is the combination of conventional hdd with softraid. The best possible choice in terms of speed would be hardware RAID in combination with NVME. Hardware RAID in combination with conventional disks is also very fast. It will do write back caching. Speed? Size? Reliability? The choice is yours according to your priority.
 
You are right, i didn't specify that speed is not a problem. We are running less than 100 websites with most of them used as a static presentation and mailing. Some e-commerce but none of them with high traffic. Having a 32gb ram and 8 thread Xeon cpu is more than enough. So my need is not size, not speed but reliability that for me means uptime and disk duration. In the past i've read that HDD are slower but in percentage much less prone to breakage than SSD. It is still so? I really would like to try NVME...but breakage is my nightmare.
 
Back
Top