1. Home
  2. Computing & Technology
  3. Email

What You Need to Know About Bayesian Spam Filtering
How Bayesian Spam Filtering Works

By , About.com Guide

If a word, "Cartesian" for example, never appears in spam but often in your legitimate mail, the probability of "Cartesian" indicating spam is near zero. "Toner", on the other hand, appears exclusively, and often, in spam. "Toner" has a very high probability of being found in spam, not much below 1 (100%).

When a new message arrives, it is analyzed by the Bayesian spam filter, and the probability of the complete message being spam is calculated using the individual characteristics.

Let's say a message contains both "Cartesian" and "toner". From these words alone it's not yet clear whether we have spam or legit mail. But other characteristics will (most probably) indicate a probability that allows the filter to classify the message as either spam or good mail.

Bayesian Spam Filters Can Adapt Automatically

Now that we have a classification, the message can be used to train the filter further. In this case, either the probability of "Cartesian" indicating good mail is lowered (if the message containing both "Cartesian" and "toner" is found to be spam), or the probability of "toner" indicating spam must be reconsidered.

Using this auto-adaptive technique, Bayesian filters can learn from both their own and the user's decisions (if she manually corrects a misjudgment by the filters). The adaptability of Bayesian filtering also makes sure they are most effective for the individual email user. While most people's spam may have similar characteristics, the legitimate mail is characteristically different for everybody.

How Can Spammers Get Past Bayesian Filters?

The characteristics of legitimate mail are just as important for the Bayesian spam filtering process as the spam is. If the filters are trained specifically for every user, spammers will have an even harder time working around everybody's (or even most people's) spam filters, and the filters can adapt to almost everything spammers try.

Spammers will only make it past well-trained Bayesian filters if they make their spam messages look perfectly like the ordinary email everybody may get. They could do that today, too.

Spammers do not usually send such ordinary emails, I presume, because they don't work. So chances are they won't be doing it when ordinary, boring emails are the only way to make it past the anti-spam filters.

If spammers do switch to mostly ordinary-looking emails, however, we will see a lot of spam in our Inboxes again, and email will may become as frustrating as it was in pre-Bayesian days (or even worse). It will also have ruined the market for most kinds of spam, though, and thus won't last for long.

One exception can be perceived for spammers to work their way through Bayesian filters even with their usual content. It's in the nature of Bayesian statistics that one word that very frequently appears in good mail can be so significant as to turn any message from looking like spam to being rated as ham by the filter.

If spammers find a way to determine your sure-fire good-mail words -- by using HTML return receipts to see which messages you opened, for example --, they can include one of them in a junk mail and reach you even through a well-trained Bayesian filter.

John Graham-Cumming has tried this by letting two Bayesian filters work against each other, the "bad" one adapting to which messages are found to get through the "good" filter. He says it works, though the process is time consuming and complex. I don't think we will see much of this happening, at least not on a large scale, and not tailored to individuals' email characteristics. Spammers may (try to) figure out some key words for organizations (something like "Almaden" for some people at IBM maybe?) instead.

Usually, spam will always be (significantly) different from regular mail or it will not be spam, though.

Executive Summary: Bayesian Filtering's Strength is Its Weakness

Bayesian spam filters are content-based filters that

  • are specifically trained to recognize the individual email user's spam and good mail, making them highly effective and difficult to adapt to for spammers.
  • can continually and without much effort or manual analysis adapt to the spammers' latest tricks.
  • take the individual user's good mail into account and have a very low rate of false positives.

Unfortunately, if this causes blind trust in Bayesian anti-spam filters, it renders the occasional mistake even more serious. The opposite effect of false negatives (spam that looks exactly like regular mail) has the potential to disturb and frustrate users.

Explore Email
About.com Special Features

Holiday Central

What to eat, where to go, fun things to do and how to save money on the perfect gifts. More >

Family Tech Center

Stay connected and entertained with reviews on tips on the latest HDTVs, cellphones and more. More >

  1. Home
  2. Computing & Technology
  3. Email
  4. Spam Stoppers
  5. Understand Spam Filtering
  6. Bayesian Spam Filtering
  7. How Bayesian Spam Filtering Works>

©2009 About.com, a part of The New York Times Company.

All rights reserved.