Training SpamAssassins accuracy with sa-learn

Posted on July 9, 2021 By Jonathan K. W.

Category: cPanel
Tags: #cpanel #email #sa-learn #spamassassin

Table of Contents

This is a guide to effectively use SpamAssassins sa-learn to help train tokens for spam flagging accuracy.

When using sa-learn, you must be very diligent and persistent in classifying both the “HAM” and “SPAM” emails for any accounts with sa-learn enabled!

What is ‘sa-learn’?

Given a typical selection of your incoming mail classified as spam or ham (non-spam), this tool will feed each mail to SpamAssassins, allowing it to ‘learn’ what signs are likely to mean spam, and which are likely to mean ham. Simply run this command once for each of your mail folders, and it will learn from the mail therein. Note that csh-style globbing in the mail folder names is supported; in other words, listing a folder name as * will scan every folder that matches. See Mail::SpamAssassin::ArchiveIterator for more details. SpamAssassins remembers which mail messages it has learnt already, and will not re-learn those messages again, unless you use the –forget option. Messages learnt as spam will have SpamAssassins markup removed, on the fly. If you make a mistake and scan a mail as ham when it is spam, or vice versa, simply rerun this command with the correct classification, and the mistake will be corrected. SpamAssassins will automatically ‘forget’ the previous indications. Users of spamd who wish to perform training remotely, over a network, should investigate the spamc -L switch.

Official sa-learn documentation can be found here: sa-learn doc

Getting started

This will only work for email accounts using IMAP.

Create the spam/ham folders

In your preferred mail client, open the email account you are configuring this for.
Inside of your Inbox create at least two new folders/directories. One for the untrained spam, and one (or more) for your ham (not spam). For this article, we’ll be using “Junk” as our untrained spam folder, and “Seen” as our ham folder.

Being diligent and pro-active

Create your new ritual of how you regularly check email. When you receive new email (and read it), start moving the email to one of the two folders. If the email is good mail, move it to “Seen”. If it’s bad/spam that SpamAssassins didn’t already catch, move it to “Junk”.
This is the most difficult part of training properly, however will provide the best results. It will take some time, however the more tokens SpamAssassins is able to collect, the more accurate it will become!

Running the sa-learn command with SpamAssassins

Executing sa-learn commands must be performed by the ‘mail account owner’ to function properly. When attempting to run these commands as the ‘root’ user, use the following syntax to run them as the user:

sudo -H -u bash -c ‘/usr/local/cpanel/3rdparty/bin/sa-learn ‘

From the command line run sa-learn on your email account’s “Junk” folder.
- sa-learn -p /home/USER/.spamassassin/user_prefs --spam /home/USER/mail/DOMAIN.TLD/ACCOUNT/.Junk/{cur,new}
  - Be sure of the path to sa-learn – on WHM v76 servers, it is /usr/local/cpanel/3rdparty/bin/sa-learn
- Depending on how many messages you have (and if you’ve run it before or not) you’ll see results similar to this: Learned tokens from 214 message(s) (1009 message(s) examined)
From the command line run sa-learn on your email account’s “Seen” folder. Note the differences here, “–ham” and “.Seen”.
- sa-learn -p /home/USER/.spamassassin/user_prefs --ham /home/USER/mail/DOMAIN.TLD/ACCOUNT/.Seen/{cur,new}
- You’ll see similar results for this, all depending on how many messages on are in the folder.
Checking learned tokens.
- Run: sa-learn --dump magic

  # sa-learn --dump magic
  0.000 0 3 0 non-token data: bayes db version
  0.000 0 1242 0 non-token data: nspam
  0.000 0 3872 0 non-token data: nham
  0.000 0 155784 0 non-token data: ntokens
  0.000 0 1404770116 0 non-token data: oldest atime
  0.000 0 1411483933 0 non-token data: newest atime
  0.000 0 0 0 non-token data: last journal sync atime
  0.000 0 1410340539 0 non-token data: last expiry atime
  0.000 0 5529600 0 non-token data: last expire atime delta
  0.000 0 137169 0 non-token data: last expire reduction count

nspam – Number of spam messages examined.
nham – Number of (non-spam) messages examined.
ntokens – Number of tokens learned.

You can run these commands manually whenever you’d like, especially if you like control, however, it can become a chore and demanding process. A lot of people prefer cron jobs for this. I’d only recommend that you only perform the cron jobs once per day, during non-peak hours.

Also, remember that training works both ways, if non-spam is making it in the spam folder, move it into “Seen” so it can learn properly for false positives.

If you try this, feel free to comment and let us know your results!

KNOWNHOST KNOWLEDGE BASE