Home

In the News

Virus Report

Subscribe Now Online

Media Kit

Archives

Contacts

Calendar of Events

Articles

Article Submissions

Web Seminars

White Papers

Inside Current Issue

April 2006 Issue

Inside Current Issue: Product of the Month

GFI MailEssentials
GFI MailEssentials for Exchange/SMTP offers phishing and spam protection at server level and eliminates the need to install and update anti-phishing and anti-spam software on each desktop. GFI MailEssentials offers a fast set-up and a high spam detection rate using Bayesian analysis and other methods - no configuration required, very low false positives through its automatic whitelist, and the ability to automatically adapt to your email environment to constantly tune and improve spam detection. It also enables you to sort spam to users' junk email folders. GFI MailEssentials also adds key email tools to your mail server: disclaimers, reporting, email archiving and monitoring, server-based auto replies anti-phishing and POP3 downloading.

How Bayesian spam filter works
Bayesian filtering is based on the principle that most events are dependent and that the probability of an event occurring in the future can be inferred from the previous occurrences of that event.

This same technique can be used to classify spam. If some piece of text occurs often in spam but not in legitimate email, then it would be reasonable to assume that this email is probably spam.

Before email can be filtered using this method, the user needs to generate a database with words and tokens (such as the $ sign, IP addresses and domains, and so on), collected from a sample of spam email and valid email (referred to as ‘ham’).

Creating a word database for the filter

A probability value is then assigned to each word or token; the probability is based on calculations that take into account how often that word occurs in spam as opposed to legitimate email. This is done by analyzing users’ outbound email and known spam: All the words and tokens in both pools of email are analyzed to generate the probability that a particular word points to the email being spam.

This word probability is calculated as follows: If the word "mortgage" occurs in 400 of 3,000 spam emails and in 5 out of 300 legitimate emails, for example, then its spam probability would be 0.8889 (that is, [400/3000] divided by [5/300 + 400/3000]).

It is important to note that the analysis of ham email is performed on the organization's email, and is therefore tailored to that particular organization. For example, a financial institution might use the word "mortgage" many times over and would get a lot of false positives if using a general anti-spam rule set. On the other hand, the Bayesian filter, if tailored to your company through an initial training period, takes note of the company's valid outbound email (and recognizes "mortgage" as being frequently used in legitimate messages), and therefore has a much better spam detection rate and a far lower false positive rate.

Besides ham email, the Bayesian filter also relies on a spam data file. This spam data file must include a large sample of known spam and must be constantly updated with the latest spam by the anti-spam software. This will ensure that the Bayesian filter is aware of the latest spam tricks, resulting in a high spam detection rate (note: this is achieved once the required initial two-week learning period is over).

Once the ham and spam databases have been created, the word probabilities can be calculated and the filter is ready for use. When a new email arrives, it is broken down into words, and the most relevant words (those that are most significant in identifying whether the email is spam or not) are singled out. From these words, the Bayesian filter calculates the probability of the new message being spam or not. If the probability is greater than a threshold, say 0.9, then the message is classified as spam.

Why Bayesian filtering is better
The Bayesian method takes the whole message into account – It recognizes keywords that identify spam, but it also recognizes words that denote valid email. For example, not every email that contains the word "free" and "cash" is spam. The advantage of the Bayesian method is that it considers the most interesting words (as defined by their deviation from the mean) and comes up with a probability that a message is spam. Bayesian filtering is a much more intelligent approach because it examines all aspects of a message, as opposed to keyword checking that classifies an email as spam on the basis of a single word.

A Bayesian filter is constantly self-adapting – By learning from new spam and new valid outbound emails, the Bayesian filter evolves and adapts to new spam techniques. For example, when spammers started using "f-r-e-e" instead of "free" they succeeded in evading keyword checking until "f-r-e-e" was also included in the keyword database. On the other hand, the Bayesian filter automatically notices such tactics; in fact if the word "f-r-e-e" is found, it is an even better spam indicator, since it’s unlikely to occur in a ham email.

The Bayesian technique is sensitive to the user – It learns the email habits of the company and understands that, for example, the word “mortgage” might indicate spam if the company running the filter is, say, a car dealership, whereas it would not indicate it as spam if the company is a financial institution dealing with mortgages.

The Bayesian method is multi-lingual and international – A Bayesian anti-spam filter, being adaptive, can be used for any language required. Most keyword lists are available in English only and are therefore quite useless in non English-speaking regions. The Bayesian filter also takes into account certain languages deviations or the diverse usage of certain words in different areas, even if the same language is spoken.

A Bayesian filter is difficult to fool, as opposed to a keyword filter – An advanced spammer who wants to trick a Bayesian filter can either use fewer words that usually indicate spam (such as free, Viagra, etc), or more words that generally indicate valid email (such as a valid contact name, etc). Doing the latter is impossible because the spammer would have to know the email profile of each recipient – and a spammer can never hope to gather this kind of information from every intended recipient. Using neutral words, for example the word "public," would not work since these are disregarded in the final analysis.

What’s the catch?
Bayesian filtering, if implemented the right way and tailored to your company is by far the most effective technology to combat spam. Is there a downside? Well, in a way there is one downside, but this can easily be overcome: Before you can use and judge the Bayesian filter, you have to wait for it to learn for at least two weeks – that or create the ham or spam databases yourself. This task can be quite complex, so it is best to wait until the filter has had time to learn. Over time, the Bayesian filter becomes more effective as it learns about your organization’s email habits. To quote the old saying, “good things come to those who wait.”



Go Back

© IMPIRE Communications, LLC All Rights Reserved.  

Website designed & managed by Oculus Networks