Issue Plesk Email Security seems not to learn spam properly

ciB · Feb 12, 2025

Since we switched to paid version, we receive a lot more spam mails. I knew that it might take some time for some training to take effect. But unfortunately it only to seems to be getting worse.

I have included some headers that were detected as spam and some not - the ones that haven't been detected were moved to/marked as spam.

Not detected as spam:

#1 MXToolbox Header Analysis
#2 MXToolbox Header Analysis

What I find curious is that the messages are marked as ham - how come? They have never been moved back to the inbox.

Detected as spam:

#3 MXToolbox Header Analysis

Also, bayes database looks like this:

Any help would be highly appreciated.

AvaDev · Feb 13, 2025

We can confirm this.

Kaspar · Feb 13, 2025

What's the output of:

Code:

sudo sa-learn -D --dump magic

ciB · Feb 13, 2025

Kaspar said:
What's the output of:

Code:

sudo sa-learn -D --dump magic

Pastebin

Kaspar · Feb 13, 2025

ciB said:
Feb 13 15:18:24.317 [1459183] dbg: plugin: Mail::SpamAssassin:lugin::Bayes=HASH(0x6122f2903230) implements 'learner_new', priority 0
Feb 13 15:18:24.317 [1459183] dbg: bayes: learner_new self=Mail::SpamAssassinlugin::Bayes=HASH(0x6122f2903230),bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
Feb 13 15:18:24.332 [1459183] dbg: bayes: using username: amavis
Feb 13 15:18:24.332 [1459183] dbg: bayes: learner_new: got store=Mail::SpamAssas sin::BayesStore::MySQL=HASH(0x6122f3f63ae8)
Feb 13 15:18:24.332 [1459183] dbg: plugin: Mail::SpamAssassin:lugin::Bayes=HASH(0x6122f2903230) implements 'learner_is_scan_available', priority 0

Looks good. Bayes databases seems to be correctly configured.

If you export the message (entire message, not just the headers) you used as an example in you're previous post that did got marked as spam to a directory somewhere on your server and run the learn command on it:
sudo sa-learn --spam /tmp/spam-test/spam.eml

Does that show the message as learned from?
Learned tokens from 1 message(s) (1 message(s) examined)

ciB · Feb 19, 2025

Kaspar said:
Looks good. Bayes databases seems to be correctly configured.

If you export the message (entire message, not just the headers) you used as an example in you're previous post that did got marked as spam to a directory somewhere on your server and run the learn command on it:
sudo sa-learn --spam /tmp/spam-test/spam.eml

Does that show the message as learned from?
Learned tokens from 1 message(s) (1 message(s) examined)

Sorry for taking a bit longer - kids sick, now me sick..
It does show the message.

Kaspar · Feb 19, 2025

ciB said:
Sorry for taking a bit longer - kids sick, now me sick..

Hope you get well soon.

ciB said:
It does show the message.

That's odd, I would expected it to return learned tokens from 0 message(s). As that would mean it got already learned from previously. It looks like the message got recognized as spam (it has spam score value higher then the spam score threshold), but somehow did not got learned from as spam.

With SpamAssassin a messages can be learned from in two different ways:

The first is directly upon receiving an email. This is called auto learning. A message is only auto learned from if the spam score of the message hits the autolearn threshold. For spam the minimum default threshold score is 12 and for ham the default maximum threshold is 0.1. Anything below 0.1 is considered to be ham. These score thresholds can be configured in SpamAssassin with the bayes_auto_learn_threshold_nonspam and bayes_auto_learn_threshold_spam setting (more info here).
The second method used for learning spam is from messages in the spam folder. A cron job is (generally) run once a day to learn from all new messages in the spam folder. (Every message is only learned from once).

I don't have access to the paid version of Plesk Email Security, so I am not sure what method exactly gets used to learn from spam messages. I assume there is a (nightly) cronjob running, besides the auto learning method.

Lets look at the headers of your email examples to get a better picture of why these messages are marked as spam and ham.

This message has a spam score of -2.083. Because this is below the default bayes_auto_learn_threshold_nonspam value of 0.1, this message is auto learned from as ham. Lets have a look at the spam test results to see why it got a negative spam score. In the X-Spam-Status header of the email all tests which returned a spam score are listed together with their score. Which are:

BAYES_00=-1.9,
DKIM_SIGNED=0.1,
DKIM_VALID=-0.1,
DKIM_VALID_AU=-0.1,
DKIM_VALID_EF=-0.1,
HTML_FONT_LOW_CONTRAST=0.001,
HTML_MESSAGE=0.001,
RCVD_IN_DNSWL_BLOCKED=0.001,
RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001,
RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001,
RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001,
T_REMOTE_IMAGE=0.01,
URIBL_BLOCKED=0.001

Except for the first test (BAYES_00) the test scores are pretty insignificant. The BAYES_00 test means that the spam probability is 0 to 1% based on the bayes learning database. In other words, too few similar messages have been found (and learned from as spam). Since none of the other tests recognizes the message as spam, the message gets a negative score. It's a bit of a negative feedback loop, because of the negative spam score the messages gets auto learned from as ham again (indicated by the autolearn=ham marker).

For comparison let look at the spam headers of the other message that did get recognized as spam. This message has a spam score of 15.14. Which it got from these spam test:

BAYES_00=-1.9,
DKIM_ADSP_NXDOMAIN=0.9,
DOS_BODY_HIGH_NO_MID=3.999,
HTML_FONT_LOW_CONTRAST=0.001,
HTML_IMAGE_ONLY_20=1.546,
HTML_MESSAGE=0.001,
HTML_MIME_NO_HTML_TAG=0.377,
MIME_HTML_ONLY=0.1,
MIME_HTML_ONLY_MULTI=0.001,
MISSING_DATE=1.36,
MISSING_MID=0.497,
MPART_ALT_DIFF=0.79,
NORDNS_LOW_CONTRAST=0.001,
RCVD_IN_DNSWL_BLOCKED=0.001,
RCVD_IN_PBL=3.335,
RCVD_IN_SBL_CSS=3.335,
RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001,
RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001,
RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001,
RDNS_NONE=0.793, URIBL_BLOCKED=0.001

You'll directly notice that this list is much longer. Compared to the other email, on this message many more tests returned a score. A bunch of them even return a pretty high score, adding up to a high total spam score. Now the spam score of 15.14 is higher than the minimum default threshold score is 12 for auto learning, yet the automarker (autolearn=no) indicates the message did not get auto leaned from. Not sure why, maybe your bayes_auto_learn_threshold_spam setting uses a higher threshold value.

It's generally not an (big) issue if a spam message did not get auto learned from. As longs as it stored in the spam folder of an mailbox it should get learned from any way when the (nightly) cron job runs. However, like I said in the beginning of my post, it's odd that this particular message got learned from again when you manually ran the sa-learn utility.

So maybe spam learning is indeed not working as expected. But is hard to say for sure. Since you bought the paid version you are eligible for support on this extension and could open a support ticket for an analysis on your server by an support engineer.

A few general remarks concerning spam filtering I like to share:
SpamAssassin evaluates email messages based on certain characteristics with a variety of "tests". Each test can attribute a score to the message, called the spam score. If the spam score of a message hits the spam threshold, the message is considered to be spam. A spam score can be a negative value too, in which case the message (often) is considered to be ham.

One of those many tests is the Bayes learning databases. However, the Bayes learning database of SpamAssassin is just one of many tests performed to determine if a message is considered spam (or not). All these different tests are evaluating an email on spammy (or hammy) characteristics. The more tests evaluate a message to be spam, the higher the likelihood of a message actually being spam. Relying on just one specific test to determine whether or not a message is spam or not is not really an advisable practice.

As an analogy, imagine someone robbed a bank. There were many people at the bank witnessing the robbery. In the ensuing chaos the bank robber got away. But the police soon apprehended several suspects near the bank. The witnesses are asked to identify the bank robber from the suspects. With the first suspect only one out of (for example) 10 witnesses identify the suspect as the actual bank robber. Now that is a very low score, meaning that the first suspect likely isn't the actual bank robber. Now if the second suspect is identified by 8 out of 10 witnesses, the likelihood of that suspect being the bank robber is much higher. Same goes for spam. The more "tests" return a spam score, the higher the likelihood of a message actually being spam. (Maybe not the best analogy, but it hopefully brings the point across).

I see you're using a very low spam score threshold. A score of just 1. This increases the chances (significantly) of a message falsely getting marked as spam. At minimum I'd suggest using a threshold score of 3, but a score between 5 and 7 is recommended. Instead of lowering the score threshold to fetch more spam, consider tweaking SpamAssassin to better recognize spam.

I don't know exactly which blacklists and addition spam tests the paid version of Plesk Email Security exactly provides. But it usually can be beneficial to tweak SpamAssassin. For example with addition rules sets (the KAM ruleset is quite popular) and plugins (for example like DCC and Spamhaus DQS).

My recommendations:

Don't rely to much on only Bayes learning to detect spam
Use a higher spam score threshold;
- Instead try to tweak your SpamAssassin install to better recognize spam.
Customize your SpamAssassin configuration with a neutral (zero) spam score for the BAYES_00 and a lower bayes_auto_learn_threshold_nonspam threshold value (for example -2.0). That way you'll less likely to have message getting caught in a negative feedback loop. To do this create a file in the /etc/spamassassin directory with an .cf file extension, for example custom_score.cf, and add:
Code:
```
bayes_auto_learn_threshold_nonspam -2.0
BAYES_00 score 0
```
Open a ticket with Plesk support if there more messages with a high spam score but aren't learned from. This could indicated the learning feature isn't working properly.

I you need/like more tools to configure and tweak SpamAssassin from within Plesk (instead of via command line) have a look a the Warden Anti-spam extension. It contains much more elaborate options to configure SpamAssassin.

Hope this helps a bit.

AvaDev · Feb 20, 2025

FYI, SpamAssassin training is available only in the paid version of the extension. See: https://support.plesk.com/hc/en-us/...-How-does-SpamAssassin-training-work-in-Plesk

Kaspar · Feb 20, 2025

AvaDev said:
FYI, SpamAssassin training is available only in the paid version of the extension. [..]

Yes, the paid version is the one being discussed here

Issue Plesk Email Security seems not to learn spam properly

ciB

Basic Pleskian

AvaDev

New Pleskian

Kaspar

API expert

ciB

Basic Pleskian

Kaspar

API expert

ciB

Basic Pleskian

Kaspar

API expert

AvaDev

New Pleskian

Kaspar

API expert

Similar threads