• If you are still using CentOS 7.9, it's time to convert to Alma 8 with the free centos2alma tool by Plesk or Plesk Migrator. Please let us know your experiences or concerns in this thread:
    CentOS2Alma discussion

Issue Acronis backup memory leak/kernel panic issue causing server crash

thebt

New Pleskian
Server operating system version
Ubuntu 18.04.6 LTS
Plesk version and microupdate number
Obsidian Version 18.0.46 Update #1
We have an issue with Acronis backup which has either a memory leak chewing all available RAM then crashing the box or a kernel panic issue where the kernel panic chews all available RAM and crashes the box

We have an open ticket with Plesk who've bumped it to Acronis who have ridiculously bad at helping here and have provided zero tangible support which is ridiculous for a backup product

Posting here as really out of options with where to take this next

The issue is happening intermittently every 1-2 weeks whereby during a backup something goes wrong, RAM usage spikes rapidly and the box crashes within 90-120 seconds as it locks up. We have 100+ sites on this server so not ideal at all. All we can do when it crashes is hard boot the box
Looking at our atop logs, the box chews through ~20gb+ of RAM in 90 seconds or so

It looks like the snap_api module is having some sort of issue. The box has 64gb of RAM with approx 30gb free. We're running backups every 2 hours and we can go days without an issue and backups run normally and then all of a sudden the issue happens randomly

The issue is happening at the kernel level not at the user level. Its not a load issue or a memory issue and the box has several hundred gb free. The backup is ~400gb or so

The box is running Wordpress sites only, Ubuntu 18 and latest Plesk and Acronis versions

We've tried
-changing swappiness down to 20 -> initially we'd thought the swappiness was too high at 50 and causing disk thrashing
--changing swappiness to 5 seems to make the problem worse

-adding a backup precommand to flush cached memory before the backup->initially we'd thought the box wasn't letting go of cached memory and thus not allowing Acronis agent enough RAM. This has reduced the frequency of the issue from 1-5 days to 1-2 weeks

-removing Acronis backup exclusions - this seems to be the solution to half their problems which is ridiculous in itself. How having exclusions in a backup is a problem is beyond me. This made no difference

-doing a full kernel update -> no change

-logging everything-> theres nothing in the logs aside from the snap_api model showing page allocation errors relating to memory which is because the box is running out of RAM

-running atop with 10 second logging -> makes it clear the issue is nothing to do with whats running on the box, load is nominal and doesn't change significantly before all RAM gets used

-uninstalling/reinstalling acronis agent

-fiddling with Cron schedules to even load out as much as possible so big operations aren't overlapping


Anyone have any ideas?
 

Attachments

  • acronis.jpg
    acronis.jpg
    200 KB · Views: 9
Hello, Acronis rep here.
I must admit that hearing such feedback about our support team is extremely rare.
Would you mind sharing the Acronis ticket number so I can review your experience and bring it up with our support directors?
Based on my previous experience I believe we’ll still need you to work with the support team to get debug data collected that is required for the dev team to continue.
 
Back
Top