TUNING BOGOFILTER'S ROBINSON-FISHER METHOD -- an updated HOWTO

(Greg Louis, September 2004)

NB: Bogotune is a tool (shipped with bogofilter) that automates the tuning process. Its "full search" mode performs a five-dimensional grid search over possible values of the parameters to be described below, and comes up with recommendations for optimal settings. There's also a "partial search" mode that is only three-dimensional. If you have enough spam and nonspam messages (at least 2,500 of each), using bogotune is highly recommended for optimizing bogofilter's accuracy.

CONTENTS

INTRODUCTION

The bogofilter program has evolved through three classification methods: the original as proposed by Paul Graham and implemented in bogofilter by Eric S. Raymond; a variation proposed by Gary Robinson and implemented in bogofilter by Greg Louis; and a further variation, also proposed by Gary Robinson, which uses Fisher's method (Fisher, R. A., 1950: Statistical Methods for Research Workers, pp. 99ff.  London: Oliver and Boyd) of combining probabilities; for bogofilter, your author has implemented this one too.  Recently, Gary Robinson described a further improvement, the application of effective size factors in the scoring process; this is available by default in bogofilter, but since it's relatively new, an option exists in both bogofilter and bogotune to switch it off.

Each of Gary Robinson's proposed classification methods works better than the earlier versions.  For optimal results, they (and the original) require some tuning.  As distributed, bogofilter attempts to supply good starting values for the tunable parameters.  Since the optimum values depend on the size and content of the wordlists at your installation, the best results can only be determined by some experimentation using your wordlists.  After several thousand each of spam and nonspam messages have been fed to bogofilter for training, this experimentation can be well worthwhile: you may be able to cut the number of spams that are still getting through by more than half.

The purpose of this document is to explain how to adjust bogofilter's parameters for best spam filtering.  A manual tuning process is described; though you'll be wise to use bogotune to help find your parameter values, understanding the manual process will help you make sense of what bogotune does.

With Robinson's changes as implemented in bogofilter, there are seven (five without effective size factors) things to tune, six (or four) of which are highly interdependent, as explained below:

ROBINSON'S x

First off, you should determine the value of x appropriate to your training database.

The way bogofilter works, summarizing briefly, is that the message being classified is separated into "tokens" -- words, IP addresses and other logical units of information.  Each token is looked up in the wordlist that makes up the training database.  The number of times it's been seen in a spam message is divided by the total number of times it's been seen, and that gives an indication of the likelihood that the token is in a spam message.  The likelihood estimates for all the tokens are combined to give a score between 0 and 1 -- 0 means the message is not likely to be a spam, 1 means it is.  That's fine, but what happens if a token's never been seen before? That's where x comes in: it's a "first guess" at what the presence of an unknown token means, in terms of its contribution to the score.  It's the value used as the likelihood estimate when a new token is found.

Obtaining a value for x is easy: bogoutil will calculate it for you.

Assuming your bogofilter wordlist is in ~/.bogofilter, run

  bogoutil -r

This will print out an x value. If it's in the range of 0.4 to 0.6, you can run  bogoutil -R ~/.bogofilter  to install the calculated value so bogofilter will use it from then on.

The value of x that bogoutil calculates is just the average of

  p(w) = badcount / (goodcount + badcount)

for every word in your training set that appears at least 10 times in spam and/or ham counts in your wordlist (i.e. badcount + goodcount >= 10).

To be honest, that's an oversimplification, for the sake of explaining the basic concept clearly.  In real life, you have to scale the counts somehow.  If you had exactly the same number of spam and nonspam messages contributing to your wordlist, the formula for p(w) given above would be ok; but we actually have to use

  p(w) = (badcount / badlist_msgcount) /
         (badcount / badlist_msgcount + goodcount / goodlist_msgcount)

where badlist_msgcount is the number of spam messages that were fed into the training set, and goodlist_msgcount is the number of nonspams used in training.

An equivalent way of calculating x, that's a little easier to read, is:

  scalefactor = badlist_msgcount / goodlist_msgcount
  p(w) = badcount / (badcount + goodcount * scalefactor)

In either case, x is the average of those p(w) values.

The calculated x is just a first guess, and it may be worth while experimenting (after tuning s and the minimum deviation as described below) to see what happens if you adjust it up or downward within a range of about 0.1 either way.

ROBINSON'S s

With a count of zero for a given token, we have only x to go on, so that's what we use as the likelihood estimate for that token.  The question then arises, what if we have seen the token before, but only a few times? Statistical variation will result in the ratio of two small numbers being rather unreliable, so perhaps we ought to compromise between that ratio and our "first guess" value.  This is what the Robinson method does.

The compromise works like this: a parameter s is defined, that serves as a weighting factor; the larger the value of s, the more importance is given to x in the presence of low token counts.  The token count is  n = badcount + goodcount  and the p(w) value for the token is modified as follows:

  f(w) = (s * x + n * p(w))/(s + n)

As you see, if n is zero, f(w) is just x, as we have been saying all along; but if n is nonzero and small, the s * x term will influence the f(w) value.

Parameters s and x are only important when the counts are small, and the value of s reflects what we think of as "small." If s is large, then when counts are small we trust our x value more than we do the p(w); if s is small, we give more weight to p(w) and less to x.

So how big should s be? Gary Robinson suggested, on a theoretical basis, that we start with a value of 1; it might be worth trying values in the range of 0.01 to 10, though I've never had good results with values larger than 1.  Making s smaller than 0.01 or so is a bad idea, because of what happens when a token is encountered that's been seen in spam before, but never in nonspam, or vice versa.  In that case, p(w) is exactly 1 or 0, and f(w) will vary greatly as the value of s diminishes.  As an example, one spam that had about ten such tokens, out of 78 that contributed to the spam score, scored 0.999 with s set to 0.001, and was found to score 0.505 when s was 1.0e-8!

Choosing a value for s is a matter of trial and error.  Experience suggests that 0.1 might be a better starting value than 1, at least if your training database is moderately large.

THE MINIMUM DEVIATION

MIN_DEV is another thing you might need to vary.  Paul Graham's original method was based on looking only at the fifteen tokens with p(w) values farthest from 0.5 (nearest to 0 or 1).  We don't do it that way; instead, we set a minimum deviation from 0.5 (Gary Robinson coined the term "exclusion radius" for this parameter), and look at all the tokens with f(w) values farther away than that.  If the minimum is set to zero, every token in the message contributes its spammishness value f(w) to the final calculation.  It might save time, and perhaps improve discrimination, to ignore tokens with f(w) values near 0.5, since those tokens obviously make less difference to the outcome of the calculation.  It seems helpful, at least once the training database is a good size, to use a MIN_DEV value between 0.3 and 0.46.  You might try 0.35 initially; one experiment suggests that even 0.44 might be a good value, but that may not work for everyone.  In fact, some people are likely to find that quite a small value of MIN_DEV (around 0.05) works best.

Note that higher values of MIN_DEV accentuate the distortion caused by small s, because tokens appearing in only one of the spam and nonspam counts will (if present) be a higher proportion of the total number of tokens considered.

EFFECTIVE SIZE FACTORS (ESF)

Tokens tend to appear more than once in a given message.  If a spam contains the word "valium" it's likely to occur several times.  Bogofilter only uses any given token once in calculating the message score, no matter how many times it appears in the message; but since "spammy" and "non-spammy" tokens may occur with very different frequency in the populations of spam and nonspam messages respectively, it helps (as Gary Robinson pointed out) to take this difference in redundancy into account.  This is done as follows:  Without effective size factors, the score is calculated with inverse chi-squared function prbx thus:

P = prbx(-2 * sum(ln(1-f(w))), 2*N)
Q = prbx(-2 * sum(ln(f(w))), 2*N)
S = (1 + Q - P) / 2

and to apply ESF, we instead calculate:

P = prbx(-2 * ln(prod(1-f(w))^y), 2*N*y)
Q = prbx(-2 * ln(prod(f(w))^z), 2*N*z)
S = Q / (Q + P)

where y and z are the spam and nonspam ESF values, the prod function returns the product of all its arguments, and S is forced to 0.5 if Q and P are both very near zero.

Determining the correct values to use for y and z is, like choosing the value for s, a matter of trial and error.  A significant corpus of messages is required and users who don't have access to large collections of their spam and nonspam messages should probably opt to do without ESF.  Useful values to try seem to be in the range of 0.75 raised to powers between 1 and 20.

THE SPAM AND NONSPAM CUTOFFS

Most spam messages will have scores very close to 1 when the Fisher method of combining the likelihood estimates is applied, and most nonspam will have scores very close to 0.  In between, there is a grey area, where messages will fall that have both spammish and nonspammish characteristics.  The spam and nonspam cutoffs are thresholds: bogofilter classes messages with scores below the nonspam cutoff as nonspam, and those with scores at or above the spam cutoff as spam. Messages with scores between the two cutoff values are classed as uncertain.  (Usually, mail administrators will want to deliver mail classed as uncertain, even though some of it may well be spam.)

The best value for the spam cutoff depends strongly on the values of the x, s, MIN_DEV and ESF parameters that are being used in the calculation.  For that reason, the way to test a given parameter set is the following:

  1. Determine a suitable level of false positives (nonspams classified as spam); this will probably need to fall somewhere in the range of 0.05 to 0.3 percent (the lower the better, of course, except that lowering the false-positive target too much gives too many false negatives).
  2. Apply the parameters to classify a number of known nonspam messages. From the distribution of scores, pick a spam cutoff value that will give the selected proportion of false positives.
  3. Apply the parameters and the chosen spam cutoff to classify a number of spam messages; the parameter set should be judged on how few false negatives (spam messages classed as uncertain or nonspam) are obtained.

The value for the nonspam threshold should be such that no more than about one in ten thousand spams is classified as nonspam.  A value of 0.2 to 0.25 might be a good starting point for use with the recommended s and MIN_DEV (0.1 and 0.35 respectively).

AN OVERVIEW OF BOGOTUNE

As already mentioned, bogotune attempts to automate the above process by doing a grid search over useful ranges of s, MIN_DEV, x, y and z (an option exists to disable ESF, i.e. leave the ESF values [y and z] at 1.0).  Bogotune validates the test inputs, calculates a suitable cache size for the training database, calculates the starting x value as described above, and picks a false-positive target with which to run the grid search.  There is an option to force a specific target (more about this shortly).  If not coerced, the target is calculated based on the number of nonspam messages in the test set, and then adjusted downward to give a cutoff between 0.5 and 0.975, above 0.55 if possible.  It's important to note that this false-positive target is chosen to facilitate the grid search, and is usually very much larger than one would want to see in production; there's no relation between the two.  At the end of its run, bogotune attempts to suggest a reasonable production target.

With these preliminaries completed, bogotune performs two scans:  The first is a coarse grid search over the entire useful range of each parameter, which should locate an approximate optimum.  A finer grid, centered on the optimum derived from the coarse search, is then scanned to produce the final recommendations.  At the end of each scan, outliers -- apparently good parameter sets from which even slight deviation degrades performance significantly -- are rejected.  Bogotune usually manages to find a robust parameter set that gives good discrimination between spam and nonspam messages, provided that sufficient training and test messages are supplied.

There are two reasons why one might want to force a bogotune run to use a specific false-positive target.  One is that sometimes bogotune's target calculation algorithm is overly optimistic, i.e. it occasionally sets the target too low.  The symptom of this problem is that many parameter combinations in the coarse grid scan simply can't deliver that few false positives from the test message corpus.  Manually increasing the target by 20 to 30 percent usually fixes this.  The other reason to coerce the target to a specific value is that one might want to compare two bogotune runs -- with and without ESF, for example -- and comparisons aren't valid unless both runs use the same target.

Bogotune may produce very different recommendations for very similar sets of spam and nonspam messages.  That's not necessarily a defect.  For many message populations, the scoring algorithm doesn't depend heavily on precise values of any of the parameters but the spam cutoff.  In such cases it's common to see bogotune's coarse scan settle on one of several local optima that may have quite different parameter values.  In general, the parameter values have more influence when a single training database is used for a large number of users, and are less crucial when the wordlist is for just one or a few users.

HOW OFTEN TO TUNE

It's probably wise to review the spam cutoff frequently.  If the false-negative count is gratifyingly low, or if false positives are occurring, increasing the cutoff will reduce the false-positive rate. If, on the other hand, there are absolutely no false positives but the false-negative rate is high, lowering the cutoff a bit may improve discrimination.

How often to tune the other parameters depends on how fussy you are about optimizing performance.  If you're eager to get the best discrimination you can, here are my recommendations:  In the early stage, while the training database is still small and growing rapidly, it's probably wise to experiment with tuning x, s and MIN_DEV once a month or so.  Once the training database is a good size (over 5000 spams and 5000 nonspams), this can be reduced to quarterly or half-yearly.  If you use ESF (which I recommend you do), retuning should probably happen a bit more often, especially if you see that the characteristics of spam you're receiving seem to be changing.

FWIW my own practice, after two years' experience, is to review the spam cutoff monthly and do a bogotune run about quarterly.

A NOTE ON TRAINING

Bogofilter's ability to distinguish accurately between spam and nonspam messages depends on the quality of its training database.  Here is a way of maximizing that quality with relatively little effort.

When starting afresh, feed every spam and every nonspam you get into bogofilter.  Do not use bogofilter's -u option to do this: there will be far too many errors when your training database is small.  Instead, classify messages manually and train bogofilter with the -n and -s options appropriately.  You can do it in batches: if you work with standard mbox files, use a mail reader to move spam and nonspam into separate files, then do  bogofilter -s < spambox  and  bogofilter -n < nonspambox to register the messages.

To find out how many spam and nonspam messages have gone into your wordlist, assuming it's kept in ~/.bogofilter, do

  bogoutil -w ~/.bogofilter .MSG_COUNT

Once you've accumulated about 5,000 spam or nonspam messages in the list, you need to let the other count catch up unless they're growing at about the same rate.  Stop adding every message to the larger of the two counts, and instead, add only messages that bogofilter got wrong or was unsure about.  To do this, you need to start classifying messages into 4 sets instead of 2: spam, nonspam, unsures that were actually spam, and unsures that were nonspam.

Once the counts are both over 5,000 and fairly similar, you can train only on errors and unsures.  By this time there should be very few errors (spams classed as nonspam or vice versa), but there will still be a proportion of unsure spam and unsure nonspam messages. Training on these will keep bogofilter working well, as you're telling it what it needs to learn, and not so much of what it already knows. If one list grows faster than the other, extra (correctly classified) messages may be added from time to time to equalize them again; try to keep the smaller list's message count at least two thirds of the larger's.

The use of bogofilter's -u option is convenient but dangerous.  Even when well trained, bogofilter will misclassify a small number of messages.  With the -u option, these mistakes are entered into the database and decrease its accuracy.  More mistakes result and the process snowballs.  You therefore need to review and correct the databases frequently, and in the interval between such reviews, bogofilter's effectiveness will keep falling off.  The method described above has the advantage that (human error excepted) no wrongly classified messages are ever entered into the database, and if you have to leave it for a time without updating it, its effectiveness doesn't diminish.


Much of the advice given here arises out of experiments reported on the author's bogofilter web pages; in particular, the report at http://www.bgl.nu/bogofilter/smindev3.html might interest those who'd like more information about the basis of bogofilter tuning.

Thanks go to David Relson for reviewing this document in draft and suggesting several improvements.

Greg Louis, 2004; last modified 2004-09-09]