September 16, 2010

What Happened to Spam?

Spam used to be a terrible problem. I remember going through my inbox, and deleting 80%+ of my messages. That was just part of checking your email. There was hand-wringing about the death of email, and public outcry, resulting in the ineffectual CAN-SPAM Act.

Today, it is rare if more than one spam email per week makes it through to my inbox. In addition, I can't remember the last time that an email that I wanted was inadvertently sent to my spam folder (a common problem with some early filtering systems).

So, what happened? Did spammers thoughtfully consider their behavior, and decide to change their ways? Did governments crack down on them, scaring them into better behavior? Nope. In reality, the amount of spam sent has continued to increase. It is our filters that have gotten better.

For a long time, the battle between the spammers and the spam filters was fairly equal. Programmers would come up with new ways to thwart spammers, and the spammers would figure out a trick to get around those tools. For example, filter software started to look for terms like "herbal Viagra", and would make any email with that term as spam. Spammers would use an image for the word "Viagra", or would spell it using a similar Unicode character.

While many of these techniques are still valuable, the real breakthrough came when spam filters started using Bayesian classification to identify spam. Where other techniques rely on clever programmers figuring out new tricks, Bayesian filters require no human intervention.

Basically, a Bayesian classifier is fed in a whole bunch of email messages, together with information about whether or not each message is spam or not. It creates its own rules about how to classify messages, and then uses these to determine whether or not incoming emails are spam.

This seems like a crazy way to do things - it seems like a set of rules would be much more effective, but Bayesian filters work better for a few reasons:

1. The rules that they create can be incredibly subtle and would never be noticed by humans. For example, maybe there is a rule that emails from a certain country that have capitalized words in the header are usually spam. A human would never be able to discern that pattern.

2. Their rules are almost impossible to reverse-engineer. Because they are so subtle and complex, spammers cannot figure out why their messages are blocked.

3. It can be user-specific. A personal classifier can be layered on top of a general filter, so that messages that contain your spouse's name, for example, are almost never marked as spam.

4. It can learn. Bayesian filters become better as users give feedback - moving messages to or from the spam folder. In addition, they automatically adjust to any new techniques that spammers use. Spam that gets through is quickly marked as spam, and the filters will learn how to identify it.

The fact that Bayesian filters are our best solution to spam is incredible, and a little unnerving. We have taught our computers to be smarter than us. The best programmer cannot write a program that filters email as effectively as a Bayesian filter. It is one thing to compare a computer's processing speed - multiplying huge numbers or solving the 1,000,000th digit of pi. But it is another thing to realize that computers can now create better solutions to some problems than we can.