What Is Bayesian Spam Filtering?

It's a probability-based system that gets better over time

By

Updated on August 20, 2019

Bayesian Filters Keep Getting Better

Simple word-based spam filters don't consider what might be considered unusual words (one clue that a given message might be spam) for each email user. In addition, they don't have the capacity to change the rules they use to identify spam over time. Bayesian spam filters are different in that they do both.

Bayesian spam filters build a list of unwanted words over time. They analyze both spam messages and good messages to calculate the probability of various characteristics appearing in spam, and in good mail. Then new, unwanted words are added to the list.

If a word never appears in spam but often in the legitimate email you receive, the probability that word indicates spam is near zero. For example, say you receive many legitimate messages that contain the word Cartesian. That fact decreases the likelihood that email messages you receive containing the word Cartesian are spam. On the other hand, say you rarely or ever receive legitimate messages that contain the word toner. If you receive a message that does contain the word toner, it's likelier to be spam.

How a Bayesian Filter Examines an Email Message

Message characteristics a Bayesian spam filter looks at include:

Words in the body of the message
Words in the message header (such as the sender and message path)
Other elements such as HTML/CSS code (such as colors and other formatting)
Word pairs and phrases
Meta information (such as where a particular phrase appears)

When a new message arrives, the Bayesian spam filter analyzes it and calculates the probability of it being spam according to these attributes.

Continuing with the examples above, suppose a message contains both words, Cartesian and toner. From these words alone it's not clear whether the message is spam or legitimate email. But if the message also contains the header "GREAT DEALS ON TONER!!!!!" then the likelihood of it being spam is increased.

Bayesian Filters Automatically Learn

Following the classification into "spam" or "legitimate email," the filter can use that determination to further train itself. In our example, the filter must either lower the probability of Cartesian indicating good mail or raise the probability of toner indicating spam. Given the additional data of the spammy header on this message (and perhaps other factors as well), it would do the latter and evaluate the next incoming message based on the new probability.

Using this auto-adaptive technique, Bayesian filters can learn from both their own and users' (if they manually correct wrongly evaluated messages) decisions. The adaptability of this system ensures these filters are most effective for individual email users because, while most people's spam may have similar characteristics, legitimate mail is characteristically different for each person.

Can Spammers Get Past Bayesian Filters?

The characteristics of legitimate email are just as important for the Bayesian spam filtering process as the characteristics of spam are. Because the filters are trained specifically for each user, spammers have a harder time working around them, and the filters can adapt to almost everything spammers try.

Spammers' messages only make it past well-trained Bayesian filters if the tricksters make their spam look like a perfectly ordinary email. But spammers don't usually send such ordinary messages because they don't work well to serve their purposes (i.e. convince you to buy something or click a link).

As good as a Bayesian filter might be, one word or characteristic that frequently appears in good mail can be so significant as to prevent a message that contains it from being rated as spam. Therefore, if spammers could find a way to determine your sure-fire good-mail words they could include one of them in a junk mail and reach you even through a well-trained Bayesian filter. But, according to researchers who have tried this method, it's time-consuming and complex enough that it's not likely to be used very frequently.

Was this page helpful?

Thanks for letting us know!

Tell us why!