Question Training SpamAssassin on Linux

philglau · Jun 6, 2023

So the support page "How to train SpamAssassin on Plesk server" indicates that we should select the "Move spam to the Spam folder" option rather than the use the mark spam with ***SPAM*** option.

Is this required in order for Plesk to automatically train SPAM?? Right now our preference it to open the inbox, and manually mark emails as SPAM (which sends them to the SPAM folder) in order to not miss any important emails that might get moved to SPAM as false positives.

I can see in a user /cur/ folder files like: '1685978160.M698294P2110691.my-server.com,S=9999,W=10130:2,Sa'

I've also tried the sa-learn utility on a bunch of SPAM messages in a user's mailbox as directed on the link above. However after running sa-learn, nothing appears in /root/.spamassassin as indicated in the article. (sa-learn does do something, it runs for a bit and then I get a message like "Learned tokens from 171 message(s) (282 message(s) "examined) " so it appear to be doing something, just not what the Plesk article suggests it should be doing. ??? (Basically Step 4 and 5 seem to either be incorrect or not working as expected on my server.)

Thank you in advance for any help you can provide.

Kaspar · Jun 6, 2023

philglau said:
So the support page "How to train SpamAssassin on Plesk server" indicates that we should select the "Move spam to the Spam folder" option rather than the use the mark spam with ***SPAM*** option.

Is this required in order for Plesk to automatically train SPAM?? Right now our preference it to open the inbox, and manually mark emails as SPAM (which sends them to the SPAM folder) in order to not miss any important emails that might get moved to SPAM as false positives.

By default Plesk runs a daily script that trains the SA Bayes spam filter with messages in the spam folder of each mailbox. Having spam messages automatically moved to the spam folder make this process easier, but isn't required. Moving messages manually to the spam folder works just as well.

philglau said:
I can see in a user /cur/ folder files like: '1685978160.M698294P2110691.my-server.com,S=9999,W=10130:2,Sa'

I've also tried the sa-learn utility on a bunch of SPAM messages in a user's mailbox as directed on the link above. However after running sa-learn, nothing appears in /root/.spamassassin as indicated in the article. (sa-learn does do something, it runs for a bit and then I get a message like "Learned tokens from 171 message(s) (282 message(s) "examined) " so it appear to be doing something, just not what the Plesk article suggests it should be doing. ??? (Basically Step 4 and 5 seem to either be incorrect or not working as expected on my server.)

I am not really sure why nothing appears in your /root/.spamassassin directory when you run the sa-learn utility. However it should not be necessary to run the sa-learn utility manually if you move your spam messages to the spam folder. The spam filter will learn from the messages in the spam filter any way. (The spam filter won't learn twice from same message). If for some reason you want train the spam filter manually it's better to run the plesk daily ExecuteSpamtrain command.

Also note that the Bayes spam filter isn't something magical that will recognize all spam messages after a while. It does not. Especially when a separate database is used for each mailbox (the sample size is usually too small to have a meaningful impact). The Bayes spam filter is just one of the many methods that can be used with Spamassassin to determine the spam score of messages.

philglau · Jun 6, 2023

Thank you for your response.

When I runs sa-learn, I do see that the mariadb process is also simultaneously running under load so I'm assuming that sa-learn is updating the database. I ran plesk daily ExecuteSpamtrain (prior to doing so) but it seems to finish in an instant whereas sa-learn took about ~30 to ~60 seconds to process ~650 SPAMs in one of my user folder.

What do you recommend (as addons?) to help control SPAM on a plesk server? I appreciate it's not going to be 'google quality' spam control, but I suspect there must be something slightly better than just bayesian SpamAssasisin. We used to use Google Workspace but the per-user cost was too much for our small business.

Kaspar · Jun 7, 2023

philglau said:
When I runs sa-learn, I do see that the mariadb process is also simultaneously running under load so I'm assuming that sa-learn is updating the database. I ran plesk daily ExecuteSpamtrain (prior to doing so) but it seems to finish in an instant whereas sa-learn took about ~30 to ~60 seconds to process ~650 SPAMs in one of my user folder.

The way the Bayes spam filter work in Plesk is that for every mailbox there is a separate Bayes database. These databases (by default) are stored in files (named bayes_toks and bayes_seen). Bayes can use MySQL or Redis, but only if you have specifically configured it that way. (So it is probably a coincidence that you see mariadb running under load simultaneously). In those database tokens are stored (based on an algorithm) of each email. That way each email can be uniquely identified and doesn't get learned from twice.

This is where it get a bit complicated. Bear with me. When you use the sa-learn utility as suggested in the Plesk support article, # sa-learn --spam /var/qmail/mailnames/example.com/johndoe/Maildir/.Spam/cur/ a completely new Bayes database gets created (in the home the directory of the user) when you run it for the first time. Because the Bayes database is new and empty it takes ~30 to ~60 seconds to process your ~650 spam messages and add them as tokens to the database. If you run it again it will finishes much faster (instantly), because there are no new messages left to learn from.

Now, to be honest the steps for using the sa-learn utility in the Plesk support article make the whole process a bit more complicate than it has to be (imho). Because a new Bayes database gets created and then you have to copy the new Bayes database to Bayes database directory of the mailbox user. A much better command for using the sa-learn utility would be:

Code:

sudo -u popuser sa-learn --spam --dbpath /var/qmail/mailnames/example.com/johndoe/.spamassassin /var/qmail/mailnames/example.com/johndoe/Maildir/.Spam/cur/

That way the sa-learn utility gets run as the popuser user (which is the file owner of the Bayes database) and the existing Bayes database gets used (located at /var/qmail/mailnames/example.com/johndoe/.spamassassin). However, this is basically the same as running plesk daily ExecuteSpamtrain. With the notable exception that plesk daily ExecuteSpamtrain is runs trough every mailbox, instead of just a single mailbox.

If you want to know if new message are learn from and added to the database after you have run either the sa-learn utility or plesk daily ExecuteSpamtrain, you can run sa-learn --dump magic --dbpath /var/qmail/mailnames/example.com/johndoe/.spamassassin. This will output the number of spam and ham emails that have been added to the database. Along with some other info.

philglau said:
What do you recommend (as addons?) to help control SPAM on a plesk server? I appreciate it's not going to be 'google quality' spam control, but I suspect there must be something slightly better than just bayesian SpamAssasisin. We used to use Google Workspace but the per-user cost was too much for our small business.

In my experience SpamAssassin can work reasonably well, but it takes some knowledge, effort (and time) to setup in such a way it's efficient enough to catch most spam. Things that help improve SpamAssassin catch spam are: configure DNSBL's, install Pyzor, Razor and DCC. If you receive a fair amount of spam in foreign languages then the TextCat can be useful too. I also prefer to use a centralized Bayes database instead of separate Bayes databases for each mailbox. Increasing the sample size of the database (but thats just a minor improvement).

Depending on your server skills you can do this yourself (with the help of tutorials and articles you can find online). I am also a great fan of the
Danami Warden Antispam extension. Which allows you do all of this and much more with a simple interface in Plesk. It's not a free extension, but worth the price in my eyes.

philglau · Jun 8, 2023

I dug into our settings. On our server we are running 'Plesk Email Security' (free) and it appears that this extension uses mySQL.

On our database server I see the database 'emailsecurity' and in it are several tables including:

Code:

| bayes_expire            |
| bayes_global_vars       |
| bayes_seen              |
| bayes_token             |
| bayes_vars  


MariaDB [emailsecurity]> select count(id) from bayes_token;

+-----------+
| count(id) |
+-----------+
|    560443 |
+-----------+

And when I run 'sa-learn' on a mailbox the count increases as it processes emails, whereas running 'plesk daily ExecuteSpamtrain' doesn't seem to make any changes to those tables.

I suspect part of my issue is that the Free Version does:
Features of the free version:

Configurable anti-spam filter (incoming/outgoing)
Server-wide and individual anti-spam settings (white-/blacklist handling, marking spam and sensitivity)
Email configuration checker (DNS/RDNS records, MX-settings, Ports)
Settings migration from built-in antispam

But it appears that 'paid' features are:

Features of the paid version:

Anti-virus scanning of emails
Anti-virus quarantine management
Daily updates of the anti-virus database
Automatic learning of spam and ham messages via actions in the email client (mark as spam/not spam)
Daily updates of the anti-spam database
Detailed statistics overview of the email traffic (ham, spam, viruses)
DNS blacklist management

Which specifically excludes 'auto-learn', daily updates of anti-spam database and using a DNSBL ???

Which probably explains why it only seems to 'learn' when I manually run sa-learn against the user folders and plesk daily ExecuteSpamtrain doesn't appear to do anything (not 100% on this claim)

Seems like perhaps the 'Plesk Email Security' extension is actually worse than just using the default Plesk install where 'ExecuteSpamtrain' or sa-learn work automatically??

Kaspar · Jun 9, 2023

philglau said:
[...] On our server we are running 'Plesk Email Security' (free) and it appears that this extension uses MySQL. [...]

Ahw, right, that explains a lot. Plesk Email Security (PES) uses a different approach on SpamAssassin. My last post and the Plesk support article you've reference in your first post only apply when using the vanilla (default) SpamAssassin setup in Plesk. So forget about all that when using PES.

I have no experience with PES so I am not sure how it uses DNSBL and trains or auto learns from spam. However based on your posts it looks like running the sa-learn utility does work when using PES when you run it manually (in that the count in the database increases). So thats a good thing. Even tough the free PES version might not automatically 'learns' from spam messages, you probably still can make it learn. By creating a cronjob that runs daily and executes the sa-learn command. The cool thing about the sa-learn utility is that it supports wildcards in the mailbox paths. So you could do sa-learn --spam /var/qmail/mailnames/*/*/Maildir/.Spam/cur/. (Which likely will take some time to finish).

tramvainqueur · Aug 15, 2023

Kaspar said:
Ahw, right, that explains a lot. Plesk Email Security (PES) uses a different approach on SpamAssassin. My last post and the Plesk support article you've reference in your first post only apply when using the vanilla (default) SpamAssassin setup in Plesk. So forget about all that when using PES.

I have no experience with PES so I am not sure how it uses DNSBL and trains or auto learns from spam. However based on your posts it looks like running the sa-learn utility does work when using PES when you run it manually (in that the count in the database increases). So thats a good thing. Even tough the free PES version might not automatically 'learns' from spam messages, you probably still can make it learn. By creating a cronjob that runs daily and executes the sa-learn command. The cool thing about the sa-learn utility is that it supports wildcards in the mailbox paths. So you could do sa-learn --spam /var/qmail/mailnames/*/*/Maildir/.Spam/cur/. (Which likely will take some time to finish).

Thank you @Kaspar . Your hint how SpamAssassin learning automatically again if this feature has been removed surprisingly, is almost everything worth. Plesks page Enable SpamAssassin Spam Filter in Plesk tells that a manual Lerning of SpamAssassin is not useless if PES is activated.

I do not know what exactly Plesk allows on the server and what it does overwrite after a reboot or something else. So I assume a sure way to implement a cron job is to do it in a Plesk way. Plesks page Integrate cron jobs with Plesk surely ensures that the cron job is never lost when added. One can even execute the command

Bash:

sa-learn --spam /var/qmail/mailnames/*/*/Maildir/.Spam/cur/

immediately to verify if it works as desired. Then you do not have to look up how to integrate cron jobs, how the cron job has to be written, when it is iteratively executale and being even informed after every execution or after executions with failures.

FuXXz · Jan 20, 2024

I'm currently having trouble learning Spam, so I'm still looking for solutions
To avoid any misunderstandings, Plesk maintains a Bayes database for each mailbox?
But If I use # sa-learn --spam /var/qmail/mailnames/example.com/johndoe/Maildir/.Spam/cur/
There are no changes to the files in the corresponding folder:
/var/qmail/mailnames/example.com/johndoe/.spamassassin.
Only the files in the /root/.spamassassin folder are changed.
Does this description here say that I have to copy the files?
https://support.plesk.com/hc/en-us/articles/12377563742231
Am I misunderstanding something? Please explain to me pls.

And if i want to learn Archives from SPAM Archive
then I have to configure each email account individually? Or is there a central database?

Kaspar · Jan 20, 2024

FuXXz said:
I'm currently having trouble learning Spam, so I'm still looking for solutions
To avoid any misunderstandings, Plesk maintains a Bayes database for each mailbox?

Yes, unless you are using a custom SpamAssassin configuration or using the Plesk Email Security extension. In which case a global Bayes database is used instead.

FuXXz said:
But If I use # sa-learn --spam /var/qmail/mailnames/example.com/johndoe/Maildir/.Spam/cur/
There are no changes to the files in the corresponding folder:
/var/qmail/mailnames/example.com/johndoe/.spamassassin.
Only the files in the /root/.spamassassin folder are changed.
Does this description here say that I have to copy the files?
https://support.plesk.com/hc/en-us/articles/12377563742231
Am I misunderstanding something? Please explain to me pls.

Run this command instead, it's much easier. No need to copy any files if you use this command.

Code:

sudo -u popuser sa-learn --spam --dbpath /var/qmail/mailnames/example.com/johndoe/.spamassassin /var/qmail/mailnames/example.com/johndoe/Maildir/.Spam/cur/

Or run this command if you want to train the bayes database of each mailbox.

Code:

plesk daily ExecuteSpamtrain

FuXXz said:
And if i want to learn Archives from SPAM Archive
then I have to configure each email account individually? Or is there a central database?

You'll have to do that manually for each mailbox.

The default SpamAssassin configuration in Plesk does not use a global Bayes database.

Question Training SpamAssassin on Linux

philglau

New Pleskian

Kaspar

API expert

philglau

New Pleskian

Kaspar

API expert

philglau

New Pleskian

Kaspar

API expert

tramvainqueur

Basic Pleskian

FuXXz

New Pleskian

Kaspar

API expert

Similar threads