Sorry for taking a bit longer - kids sick, now me sick..
Hope you get well soon.
It does show the message.
That's odd, I would expected it to return
learned tokens from 0 message(s)
. As that would mean it got already learned from previously. It looks like the message got recognized as spam (it has spam score value higher then the spam score threshold), but somehow did not got learned from as spam.
With SpamAssassin a messages can be learned from in two different ways:
- The first is directly upon receiving an email. This is called auto learning. A message is only auto learned from if the spam score of the message hits the autolearn threshold. For spam the minimum default threshold score is 12 and for ham the default maximum threshold is 0.1. Anything below 0.1 is considered to be ham. These score thresholds can be configured in SpamAssassin with the
bayes_auto_learn_threshold_nonspam
and bayes_auto_learn_threshold_spam
setting (more info here).
- The second method used for learning spam is from messages in the spam folder. A cron job is (generally) run once a day to learn from all new messages in the spam folder. (Every message is only learned from once).
I don't have access to the paid version of Plesk Email Security, so I am not sure what method exactly gets used to learn from spam messages. I assume there is a (nightly) cronjob running, besides the auto learning method.
Lets look at the headers of your email examples to get a better picture of why these messages are marked as spam and ham.
This message has a spam score of -2.083. Because this is below the default
bayes_auto_learn_threshold_nonspam
value of 0.1, this message is auto learned from as ham. Lets have a look at the spam test results to see why it got a negative spam score. In the
X-Spam-Status
header of the email all tests which returned a spam score are listed together with their score. Which are:
- BAYES_00=-1.9,
- DKIM_SIGNED=0.1,
- DKIM_VALID=-0.1,
- DKIM_VALID_AU=-0.1,
- DKIM_VALID_EF=-0.1,
- HTML_FONT_LOW_CONTRAST=0.001,
- HTML_MESSAGE=0.001,
- RCVD_IN_DNSWL_BLOCKED=0.001,
- RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001,
- RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001,
- RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001,
- T_REMOTE_IMAGE=0.01,
- URIBL_BLOCKED=0.001
Except for the first test (
BAYES_00
) the test scores are pretty insignificant. The
BAYES_00
test means that the spam probability is 0 to 1% based on the bayes learning database. In other words, too few similar messages have been found (and learned from as spam). Since none of the other tests recognizes the message as spam, the message gets a negative score. It's a bit of a negative feedback loop, because of the negative spam score the messages gets auto learned from as ham again (indicated by the
autolearn=ham
marker).
For comparison let look at the spam headers of the other message that did get recognized as spam. This message has a spam score of 15.14. Which it got from these spam test:
- BAYES_00=-1.9,
- DKIM_ADSP_NXDOMAIN=0.9,
- DOS_BODY_HIGH_NO_MID=3.999,
- HTML_FONT_LOW_CONTRAST=0.001,
- HTML_IMAGE_ONLY_20=1.546,
- HTML_MESSAGE=0.001,
- HTML_MIME_NO_HTML_TAG=0.377,
- MIME_HTML_ONLY=0.1,
- MIME_HTML_ONLY_MULTI=0.001,
- MISSING_DATE=1.36,
- MISSING_MID=0.497,
- MPART_ALT_DIFF=0.79,
- NORDNS_LOW_CONTRAST=0.001,
- RCVD_IN_DNSWL_BLOCKED=0.001,
- RCVD_IN_PBL=3.335,
- RCVD_IN_SBL_CSS=3.335,
- RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001,
- RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001,
- RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001,
- RDNS_NONE=0.793, URIBL_BLOCKED=0.001
You'll directly notice that this list is much longer. Compared to the other email, on this message many more tests returned a score. A bunch of them even return a pretty high score, adding up to a high total spam score. Now the spam score of 15.14 is higher than the minimum default threshold score is 12 for auto learning, yet the automarker (
autolearn=no
) indicates the message did not get auto leaned from. Not sure why, maybe your
bayes_auto_learn_threshold_spam
setting uses a higher threshold value.
It's generally not an (big) issue if a spam message did not get auto learned from. As longs as it stored in the spam folder of an mailbox it should get learned from any way when the (nightly) cron job runs. However, like I said in the beginning of my post, it's odd that this particular message got learned from again when you manually ran the sa-learn utility.
So maybe spam learning is indeed not working as expected. But is hard to say for sure. Since you bought the paid version you are eligible for support on this extension and could open a support ticket for an analysis on your server by an support engineer.
A few general remarks concerning spam filtering I like to share:
SpamAssassin evaluates email messages based on certain characteristics with a variety of "tests". Each test can attribute a score to the message, called the spam score. If the spam score of a message hits the spam threshold, the message is considered to be spam. A spam score can be a negative value too, in which case the message (often) is considered to be ham.
One of those many tests is the Bayes learning databases. However, the Bayes learning database of SpamAssassin is just one of many tests performed to determine if a message is considered spam (or not). All these different tests are evaluating an email on spammy (or hammy) characteristics. The more tests evaluate a message to be spam, the higher the likelihood of a message actually being spam. Relying on just one specific test to determine whether or not a message is spam or not is not really an advisable practice.
As an analogy, imagine someone robbed a bank. There were many people at the bank witnessing the robbery. In the ensuing chaos the bank robber got away. But the police soon apprehended several suspects near the bank. The witnesses are asked to identify the bank robber from the suspects. With the first suspect only one out of (for example) 10 witnesses identify the suspect as the actual bank robber. Now that is a very low score, meaning that the first suspect likely isn't the actual bank robber. Now if the second suspect is identified by 8 out of 10 witnesses, the likelihood of that suspect being the bank robber is much higher. Same goes for spam. The more "tests" return a spam score, the higher the likelihood of a message actually being spam. (Maybe not the best analogy, but it hopefully brings the point across).
I see you're using a very low spam score threshold. A score of just 1. This increases the chances (significantly) of a message falsely getting marked as spam. At minimum I'd suggest using a threshold score of 3, but a score between 5 and 7 is recommended. Instead of lowering the score threshold to fetch more spam, consider tweaking SpamAssassin to better recognize spam.
I don't know exactly which blacklists and addition spam tests the paid version of Plesk Email Security exactly provides. But it usually can be beneficial to tweak SpamAssassin. For example with addition rules sets (the
KAM ruleset is quite popular) and plugins (for example like DCC and
Spamhaus DQS).
My recommendations:
- Don't rely to much on only Bayes learning to detect spam
- Use a higher spam score threshold;
- Instead try to tweak your SpamAssassin install to better recognize spam.
- Customize your SpamAssassin configuration with a neutral (zero) spam score for the
BAYES_00
and a lower bayes_auto_learn_threshold_nonspam
threshold value (for example -2.0). That way you'll less likely to have message getting caught in a negative feedback loop. To do this create a file in the /etc/spamassassin
directory with an .cf file extension, for example custom_score.cf
, and add:
Code:
bayes_auto_learn_threshold_nonspam -2.0
BAYES_00 score 0
- Open a ticket with Plesk support if there more messages with a high spam score but aren't learned from. This could indicated the learning feature isn't working properly.
I you need/like more tools to configure and tweak SpamAssassin from within Plesk (instead of via command line) have a look a the
Warden Anti-spam extension. It contains much more elaborate options to configure SpamAssassin.
Hope this helps a bit.