Peter, could you please share a little bit more information, is it an in-house solution? How it works if performs better results than other solutions like `monit`?
It's an inhouse solution; a rather extensive PHP script run on the command line that checks everything and anything, e.g. if the mail service is running, if the Dovecot SNI files are in place, even if there are suspicious cron jobs. It also checks the load of Apache processes, fixes typical issues like remaining symbolic links to non-existent physical Apache conf files after a user removed a domain, can restart PHP-FPM version-independently if the cpu load becomes too high or the number of Apache processes too many etc. Lots of other stuff, e.g. edac-util output, cpu temperature (e.g. sometimes a fan fails, which service monitoring utilities won't detect), database status, number of database processes per user, cpu load per user etc. It also checks that users stick with configurations that are valid by their contract (some users tend to try to extend e.g. PHP-FPM RAM usage; a similar other thread recently had this as a topic in "Reports"). It also opens test websites for each PHP-FPM version to check whether they really respond, and it also opens Roundcube webmail websites and tests whether the login page is actually displayed (as there are cases when Roundcube becomes inaccessible). And: It can fix many issues by itself. It basically does everything that is required to keep a server running smoothly. It pre-emptively does straces on processes if there could be a problem coming up soon, so that once a problem exists we already have the system status "before" it occured. The scrip
- checks every 10 seconds
- checks way more than simple service uptime monitoring does, including file contents, potentially malicious cron entries etc.
- can fix common problems (auto-doc function), can block user accounts in case of suspicious activities
- sends notifications as required
Simply service monitoring just isn't enough. You also must check that all important system configuration files are correct and in place plus that the server response time is in an acceptable range plus that the hardware is good and that users don't circumvent restrictions.