Think about how you detect spam. A quick glance is often enough. You know what spam looks like, and you know what good mail looks like. The probability of spam looking like good mail is around... zero.
Scoring Content-Based Filters Do Not Adapt
Wouldn't it be great if automatic spam filters worked like that, too? Scoring content-based spam filters try it. They look for words and other characteristics typical of spam. Every characteristic element is assigned a score, and a spam score for the whole message is computed from the individual scores. Some scoring filters also look for characteristics of legitimate mail, lowering the complete score.
The scoring filters approach works, but it also has several problems.
- The list of characteristics is built from the spam (and the good mail) the filter maker gets. To get a good grasp of the typical spam anybody might get, mail must be collected at hundreds of email addresses. This weakens the efficiency of the filters, especially because the characteristics of good mail will be different for each person, but this is not taken into account.
- The characteristics to look for are more or less set in stone. If the spammers make the effort to adapt (and make their spam look like good mail to the filters), the filtering characteristics have to be tweaked manually an even bigger effort.
- The score assigned to each word is probably based on a good estimate, but it is still arbitrary. And like the list of characteristics it does adapt neither to the changing world of spam in general nor to an individual user's needs.
Bayesian Spam Filters Tweak Themselves, Getting Better and Better
Bayesian spam filters are a kind of scoring content-based filters, too. But this approach does away with the problems of simple scoring spam filters, and it does so radically. Since the weakness of scoring filters is in the manually built list of characteristics and their scores, this list is eliminated.
Instead, Bayesian spam filters build the list themselves. Ideally, you start with a (big) bunch of emails that you have classified as spam, and another bunch of good mail. The filters look at both, and analyze the legitimate mail as well as the spam to calculate the probability of various characteristics appearing in spam, and in good mail.
The characteristics a Bayesian spam filter can look at can be
- the words in the body of the message, of course, and
- its headers (senders and message paths, for example!), but also
- other aspects such as HTML code (like colors), or even
- word pairs, phrases and
- meta information (where a particular phrase appears, for example).