Disk utilization increased after upgrade from 12.0

Zaxos Dogganos · Jan 4, 2016

On two Debian GNU/Linux 7.9 (wheezy) servers, the disk utilization increased after upgrade from 12.0 to 12.5. I can't understand what is causing it. Main suspect is mysql but by searching with iostat, iotop, atop I could not identify something different.

I am quite sure that the plesk upgrade caused it because the two identical servers were upgraded at very different dates: First server was upgraded on 9th November, and second server was upgraded on 30th December. You can see the difference in the munin graphs. Anybody else experiencing the same thing?

Zaxos Dogganos · Jan 4, 2016

News after much searching: The guilty is fail2ban, which by the way cannot be properly stopped. Something must have changed after the upgrade. Will continue searching...

Zaxos Dogganos · Jan 4, 2016

So, this is it:
https://kb.odin.com/en/122407
Deactivating plesk-wordpress, plesk-apache-badbot, plesk-apache jails, brings immediately the disk load down.

trialotto · Jan 4, 2016

@Zaxos Dogganos

Fail2Ban itself does not cause any heavy diskloads and Fail2Ban certainly is not the issue you should worry about.

Let me explain a bit.

Fail2Ban monitors logs and by simply shutting down jails, one does reduce the number of reads (not writes).

Your graphs do not make any distinction between reads and writes, so it is difficult to come to any conclusion from those graphs.

However, due to the nature of your posts, one can safely assume that you have some "normal" alternatives for the underlying issue, one of them is really a problem.

The fact that disk activity decreases when shutting down jails is indicating that your logs are fairly huge.

That is fairly remarkable for a new server, for which the logs should be relatively small.

One possible explanation is some logrotation problem, that often is caused by the coincidence of cronjobs and/or simultaneous tasks executed in Plesk.

Due to your remarks about the plesk-wordpress jail and the plesk-apache jail, it can be the case that WordPress cronjobs are causing the high disk activity (and resource overusage).

However, that would not explain the fact that shutting down plesk-apache-badbot also helps to decrease disk activity.

The other feasible explanation is the possibility that your server has been under attack (of some form) for considerable time.

This would fill up logs considerably (and make logrotation issues more likely AND Fail2Ban has to "work harder").

Sure, the common problem of WordPress cronjobs jamming up the server and causing resource overusage would still add to that problem of "huge logs", but that is mere coincidence.

In conclusion, I am pretty sure that you should not deactivate Fail2Ban or it´s jails, since there is a high/huge probability that you are under continuous attack.

In short, analyse the probability of attacks AND try to block (a lot of) IPs with the Plesk Firewall AND reactivate all of the Fail2Ban jails.

Hope the above helps.

Regards!

Zaxos Dogganos · Jan 5, 2016

First of all, I thank you for your elaborate answer!

Since yesterday that I stopped the apache jails the disk utilization has been restored to pre-12.5 levels. See diskstats_utilization-week.png, where the impressive increase is exactly after the upgrade and until yesterday that I stopped the jails. So, it cannot be the coincidental case of an attack.

trialotto said:
Fail2Ban monitors logs and by simply shutting down jails, one does reduce the number of reads (not writes).
Your graphs do not make any distinction between reads and writes, so it is difficult to come to any conclusion from those graphs.

About the reads and writes I agree with you in theory but see the diskstats_iops-week.png graph where it is shown that after the upgrade and before stopping the jails both reads and writes were increased, and paradoxically writes were increased much more than reads, which doesn't make much sense...

About the logs, they are 200 apache log files ranging from a few KiBs το ~70MiBs (I just re-checked), so they are far from being huge.

And yet another strange thing is that at the other server where due to less sites I have the jails still active, I cannot find with lsof the access_log files of the sites being open for read by fail2ban.

So, how could I investigate what happens with fail2ban?

trialotto · Jan 5, 2016

@Zaxos,

Some general remarks have to be made in respond to your post: for the sake of convenience, I will quote components of your post and comment to them.

Zaxos Dogganos said:
Since yesterday that I stopped the apache jails the disk utilization has been restored to pre-12.5 levels. See diskstats_utilization-week.png, where the impressive increase is exactly after the upgrade and until yesterday that I stopped the jails. So, it cannot be the coincidental case of an attack.

Too soon, much too soon to conclude this: the deactivation of Fail2Ban jails does not imply that specific IPs are stilled banned for some time period in the near future, which on it´s turn implies that some malicious IPs (i.e. those that still are banned, even though the jail is deactivated) will not "enter the system".

In short, only after some days, one can make conclusions about the effects of deactivating a Fail2Ban jail.

Zaxos Dogganos said:
About the reads and writes I agree with you in theory but see the diskstats_iops-week.png graph where it is shown that after the upgrade and before stopping the jails both reads and writes were increased, and paradoxically writes were increased much more than reads, which doesn't make much sense...

Ehm, it "makes sense", in the sense that some normal disk activity (with fairly normal spikes) is exhibited in your graph.

By the way, all the monitoring also causes spikes, so any conclusion would be a little bit biased.

Zaxos Dogganos said:
About the logs, they are 200 apache log files ranging from a few KiBs το ~70MiBs (I just re-checked), so they are far from being huge.

Which apache log files? A number of 200 should not really be the case, so I am afraid that I am misunderstanding this part of your response.

Anyway, for a relatively new server, these logs are in fact fairly huge and, moreover, it is not about size: the contents of those log files matter.

For instance, 10.000 http requests per month from a huge number of unique IPs is not a problem, a high number of repeated requests from the same IP can be a reason to worry and/or an attempt to access specific ports or pages (i.e. php scripts) by a huge number of IPs in a short time can be a reason to fear a distributed attack (of some kind).

Zaxos Dogganos said:
And yet another strange thing is that at the other server where due to less sites I have the jails still active, I cannot find with lsof the access_log files of the sites being open for read by fail2ban.

Not surprising, lsof is the "wrong command".

You can always have a look at /var/log/fail2ban.log: that way, you can check the domains that are actually scanned by Fail2Ban (please note that you do not have to be afraid when certain entries are not present, since particular domains can be suspended or shielded off by firewalls, implying that fail2ban.log does not contain or barely contains information).

Zaxos Dogganos said:
So, how could I investigate what happens with fail2ban?

In order to analyse the actions of Fail2Ban, have a look at /var/log/fail2ban.log.

In order to analyse the functioning of Fail2Ban, go to "Tools & Settings > IP Address Banning (Fail2Ban)" and have a look at the tabs "Settings" and "Jails" (i.e. click on a jail name and you can have a glance at the setup of the jail) and also have a look at "Jails > Manage filters" (i.e. some other jails are present there).

Note that you can "harden" Fail2Ban jails by setting the max retry setting to a lower value and/or to increase the ban period, with the associated advantage that the Fail2Ban "power" AND overall server performance is increased, simply due to the fact that bans are occurring more frequent and a lot of traffic is rejected during the pre-defined and longer ban period.

In order to get some general understanding about Fail2Ban, the dis- and advantages of Fail2Ban, the do´s and don´ts with Fail2Ban and so on, just have a look in this forum (since I have created some posts about Fail2Ban in the past) OR have a look at the Fail2Ban project page (http://www.fail2ban.org).

In essence, Fail2Ban is just a log scanner, augmented with a script to inject temporary firewall rules into iptables (and the latest version also allows permanent firewall rules).

That is all, it is no rocket science.

In general, nothing interesting happens with Fail2Ban, but I strongly advice to have a look at the jails and improve the settings thereof, it can make a huge difference.

Regards....

Zaxos Dogganos · Jan 23, 2016

After some time busy with other things (since by stopping the apache jaiils I achieved to avoid clients complaining about the slow server), I thought it's time to research again the problem with the load that to me seems to increase when activating the apache jails.

So, today, two hours ago, I re-enabled the apache jails (plesk-apache, plesk-apache-badbot).

Immediately, the effect on the server was more than obvious. Please look at the graphs, especially the disk utilization graph and the load graph (the last two hours are with the jails enabled).

About some remarks of yours:
You wrote
"Which apache log files? A number of 200 should not really be the case, so I am afraid that I am misunderstanding this part of your response."

They're actually 300. The server hosts 100 sites and it's 3 log files for each site.
Example, taken from fail2ban.log (I replaced the name of the actual site with 'xxxx'):
2016-01-23 19:42:08,530 fail2ban.filter [32747]: INFO Added logfile = /var/www/vhosts/system/xxxxx.gr/logs/error_log
2016-01-23 19:43:31,928 fail2ban.filter [32747]: INFO Added logfile = /var/www/vhosts/system/xxxxx.gr/logs/access_log
2016-01-23 19:43:33,970 fail2ban.filter [32747]: INFO Added logfile = /var/www/vhosts/system/xxxxx.gr/logs/access_ssl_log

These are the log files that fail2ban monitors in order to find e.g. bots trying malicious things on a site of the server and it's strange to me that with the command 'lsof' I cannot see fail2ban reading these log files. It's not the wrong command as you wrote: When a file is open by some process it should appear on lsof.

So, I hope you got now convinced that fail2ban's apache jails brings the server load up.

And as I mentioned on a previous post, I found this
https://kb.odin.com/en/122407
which describes exactly that:

Cause

Fail2ban has plesk-apache-badbot and plesk-apache (or other big) jails enabled. That jail forces Fail2ban to parse all the access and error logs for each virtual host and Apache's access log.

If there are a lot of virtual host access logs, the service hangs as a result of resource overusage when trying to parse them.

NOTE: When you enable this jail in Plesk, you may see the warning:

Warning: Fail2Ban might not work well if there are many domains and Fail2Ban has to monitor too many log files.

trialotto · Jan 23, 2016

@Zaxos Dogganos

In essence, you are right to some degree, in the sense that there are some errors in your line of reasoning.

I am not writing this post to start or restart a discussion, it is just intended as a guideline to improvements in your hosting environment.

In fact, there are two (main) factors contributing to your issue:

a) Fail2Ban does not handle regexp very well (understatement of the century), implying that every * will cause Fail2Ban to do a lot of "work", that results in resource usage, (and)

b) Apache can be very verbose, depending on

- the general settings you have chosen for logging, and/or
- settings for Apache restarts (for instance, setting the restart interval to 0 will cause huge logs, certainly in the case of multiple domains), and/or
- many other factors, including those related to the OS system settings,

and so on. Any verbosity of Apache will increase the need for Fail2Ban to use a lot of resources.

Again, Fail2Ban does not handle regexp very well, this also applies to the process of reading the logs, (and)

c) Apache in the Plesk stack uses a lot of separate log files, increasing the probability that Fail2Ban does use a lot of resources, (and)

d) Plesk is containing an Apache + Nginx stack and not using Nginx and/or not efficiently using Nginx will result in more (Apache and other) logs that should be deemed necessary.

The above is not exhaustive summary, it is a rough outline of some relevant factors.

In short, you can simply improve the hosting environment by:

- putting Nginx in front of Apache, hence decreasing the quantity of output in all different log files,
- add directives in Nginx to stop certain traffic, such as bots or even bad bots,
- serve static files by Nginx and/or use caching mechanisms, in order to reduce the number of (direct) requests to Apache,
- remove any Nginx related jails in Fail2Ban, hence reducing the workload of Fail2Ban (if most of the traffic ends up being served by Nginx),
- remove wildcards and any other regexp from Fail2Ban jails and filters, if and whenever possible,

and so on.

In conclusion, it is not a problem with and/or due to Fail2Ban or Apache, but in essence a problem related to general misconfiguration that does not take into account the specific issues of Fail2Ban, Apache or whatever element in the Plesk stack is being used.

As a final example, it is fairly easy to copy "tails" of log files to one central log file, that can be used by Fail2Ban to process IP bans.

The last suggestion should help a bit: any "relatively small" log file, containing all relevant entries (i.e. relevant for Fail2Ban processing), should reduce Fail2Ban´s resource usage.

Hope that all the above helps to find a solution for your problem with Fail2Ban.

Regards.....

Zaxos Dogganos · Jan 25, 2016

Thank you for your general guidelines, but the problem has been solved today.

I tried the solution of https://kb.odin.com/en/122407 which basically says that you should break the hundreds of log files which fail2ban has to examine to many jails. This really fixed things, but makes a mess with the dozens of new jails which have to be maintained.

I found a better solution which completely solves the disk utilization problem and subsequently the load problem (without making any other changes to the jails). I performed a system audit and traced the system calls writing to the disk. I found that there was one file that was written and read hundreds of times per second: /var/lib/fail2ban/fail2ban.sqlite3
This is (as I understand) where the info read in the log files is written so that fail2ban can deduce attackers and block them. So I searched the config files of fail2ban and found this option:

# Options: dbfile
# Notes.: Set the file for the fail2ban persistent data to be stored.
# A value of ":memory:" means database is only stored in memory
# and data is lost when fail2ban is stopped.
# A value of "None" disables the database.
# Values: [ None :memory: FILE ] Default: /var/lib/fail2ban/fail2ban.sqlite3
dbfile = /var/lib/fail2ban/fail2ban.sqlite3

As soon as I changed this to
dbfile = :memory:

all was quiet on the server again. CASE CLOSED!

Thanks for all your help and ideas anyway.

trialotto · Jan 25, 2016

@Zaxos Dogganos

A general tip, since you are really bound to aggravate the whole Fail2Ban issue by putting it into memory: simply disable the option for persistent IP bans by Fail2Ban.

At this moment, you are using a solution that is actually not a solution.

Read the following remarks, if you will.

A - Fail2Ban and challenges of memory usage

Your memory will be used to the same extent or degree as your disk has been used or overused, with one problem though: any memory failure will shut down the whole system.

Consider your system being under some kind of distributed attack, with a lot of IPs causing traffic to your server: your server will store them (more or less) in memory and, in the long run and/or under heavy loads, your memory is exhausted.

In cases of heavy attacks, the irony is that with the option ":memory:" Fail2Ban is the first program to shut down and/or fail, i.e. it really becomes (literally) Fail 2 Ban (any) IP.

B - Memory usage and common VPS

Most memory of VPS is actually some kind of swap: regular swap, or some kind of advanced sharing of memory on the host system.

In general, providers do not tell you that swap is a large part of memory: you think you have (for instance) 12GB, but this is often 8GB (shared!) memory and 4 GB swap.

If the system is heavily depending on swap, loading into "memory" in cases of attacks and/or heavy attacks will result in using the (very slow) swap, which is a sort of designated disk.

Moreover, in many cases of attacks, any form of heavy usage of shared memory on a VPS will often result in the host system cutting down on memory for the VPS or a problem in the host system itself, leaving the VPS with actual problems.

In short, any assignment of IP bans to memory will lead to problems and/or even the usage of swap, which is not "common memory" (making the ":memory:" option obsolete).

Note that there are many attack scripts, specifically designed to attack vulnerable VPSes in order to disable the host system and primarily the security measures on the host system.

C - Fail2Ban design and persistent IP bans

Fail2Ban uses the SQLite database to create "persistent" IP bans, i.e. bans that are persistent across reboots.

The ":memory:" option does not really create persistent IP bans: every reboot, memory is flushed.

As such, any usage of memory for IP banning is somewhat "strange", at least not adviceable.

D - Fail2Ban performance and factual quality

Fail2Ban has been started as an admirable project and it still is, but with the introduction of the "persistent" IP bans, all other bugs and bug fixes were more or less forgotten.

This has been a trend that also applies to the SQLite database approach for persistency: it is not the best approach, for many reasons (and patches are not coming shortly).

Moreover, even given a specific level of quality of the code behind Fail2Ban, Fail2Ban performs AND behaves poorly when using the ":memory:" option.

In short, it is one of the many options associated with Fail2Ban, that should not be used.

A number of similar remarks can be made with respect to other Fail2Ban options, such as the option to "connect" with one or more blacklists: it is possible, but not preferable.

E - Conclusion

Everybody can do a lot with Fail2Ban, but the most intricate solutions are not the domain of Fail2Ban: it is a crude solution, a "second line of defense" solution.

After all, malicious traffic should simply not enter the system and cause lines in the various log files scanned by Fail2Ban.

In conclusion, the emphasis should be more on a proper firewall setup, only followed by a proper setup of the second line of defense offered by Fail2Ban.

And yes, Fail2Ban enters iptables rule sets, but only does so after entry has been granted by a lack of other firewall rules.

One should simply be one step ahead: block malicious IPs BEFORE any entry (i.e. traffic) has been allowed.

Regards....

Zaxos Dogganos · Jan 25, 2016

Thanks again for the suggestions.

Disk utilization increased after upgrade from 12.0

Zaxos Dogganos

New Pleskian

Attachments

Zaxos Dogganos

New Pleskian

Zaxos Dogganos

New Pleskian

trialotto

Golden Pleskian

Zaxos Dogganos

New Pleskian

Attachments

trialotto

Golden Pleskian

Zaxos Dogganos

New Pleskian

Attachments

trialotto

Golden Pleskian

Zaxos Dogganos

New Pleskian

trialotto

Golden Pleskian

Zaxos Dogganos

New Pleskian