September 16, 2010

What Happened to Spam?

Spam used to be a terrible problem. I remember going through my inbox, and deleting 80%+ of my messages. That was just part of checking your email. There was hand-wringing about the death of email, and public outcry, resulting in the ineffectual CAN-SPAM Act.

Today, it is rare if more than one spam email per week makes it through to my inbox. In addition, I can't remember the last time that an email that I wanted was inadvertently sent to my spam folder (a common problem with some early filtering systems).

So, what happened? Did spammers thoughtfully consider their behavior, and decide to change their ways? Did governments crack down on them, scaring them into better behavior? Nope. In reality, the amount of spam sent has continued to increase. It is our filters that have gotten better.

For a long time, the battle between the spammers and the spam filters was fairly equal. Programmers would come up with new ways to thwart spammers, and the spammers would figure out a trick to get around those tools. For example, filter software started to look for terms like "herbal Viagra", and would make any email with that term as spam. Spammers would use an image for the word "Viagra", or would spell it using a similar Unicode character.

While many of these techniques are still valuable, the real breakthrough came when spam filters started using Bayesian classification to identify spam. Where other techniques rely on clever programmers figuring out new tricks, Bayesian filters require no human intervention.

Basically, a Bayesian classifier is fed in a whole bunch of email messages, together with information about whether or not each message is spam or not. It creates its own rules about how to classify messages, and then uses these to determine whether or not incoming emails are spam.

This seems like a crazy way to do things - it seems like a set of rules would be much more effective, but Bayesian filters work better for a few reasons:

1. The rules that they create can be incredibly subtle and would never be noticed by humans. For example, maybe there is a rule that emails from a certain country that have capitalized words in the header are usually spam. A human would never be able to discern that pattern.

2. Their rules are almost impossible to reverse-engineer. Because they are so subtle and complex, spammers cannot figure out why their messages are blocked.

3. It can be user-specific. A personal classifier can be layered on top of a general filter, so that messages that contain your spouse's name, for example, are almost never marked as spam.

4. It can learn. Bayesian filters become better as users give feedback - moving messages to or from the spam folder. In addition, they automatically adjust to any new techniques that spammers use. Spam that gets through is quickly marked as spam, and the filters will learn how to identify it.

The fact that Bayesian filters are our best solution to spam is incredible, and a little unnerving. We have taught our computers to be smarter than us. The best programmer cannot write a program that filters email as effectively as a Bayesian filter. It is one thing to compare a computer's processing speed - multiplying huge numbers or solving the 1,000,000th digit of pi. But it is another thing to realize that computers can now create better solutions to some problems than we can.

August 14, 2010

An Opposition to Net Neutrality

Net Neutrality has been in the news lately. Basically, net neutrality supporters want the FCC to make it illegal for internet providers to privilege certain types of Internet traffic over other types. Basically, Google can't pay Comcast to make sure that YouTube videos come through in higher quality than other videos on the web.

On paper, this sounds great. The web, as net neutrality opponents are wont to say, was built on equality, and we shouldn't introduce inequalities. I actually agree that a neutral web is, all things being equal, better.

However, there are some problems with enforcing this equality. I'm going to gloss over the issue that I find most salient, because I also think it is the most obvious. Namely, the question of whether or not it is the government's place to regulate how ISPs operate. They own the networks that the Internet runs on - they paid to lay the fiber, and I think that they have an inherent right to operate it how they choose.

That issue aside, I think that the potentially bigger issue is less obvious. Government regulation so often fails because it focuses on improving the present. Currently, there are not many companies that own much of the Internet infrastructure, and it makes good sense to ensure that they don't do things which are bad for the Internet ecosystem - an ecosystem which has become more and more important to all aspects of our society. Regulation which stops them from ruining the Internet is regulation which seems to help everyone.

The only people that it doesn't help are our future selves. By enforcing net neutrality, we stop Internet providers from doing innovative things, and thereby make laying more Internet cables less lucrative. If Google was willing to pay billions of dollars to prioritize their traffic, you can be sure that ISPs would be laying cable all over the country to make sure that they were able to meet that contract. And that cable would help everyone - not just Google. Sure, maybe Google would get the lion's share of benefit, but would be "trickle-down bandwidth" for everyone.

By letting people decide with their wallets what their real priorities are, you allow companies to do innovative things, and improve the experience for everyone. Technology is magically improving our lives all the time. If we regulate to ensure the status quo, then we will get the status quo. We will miss out on the future that could have been, and the future is (almost) always better.

April 4, 2010

The Power of Ignorance

200 years ago, the majority of Americans were rural farmers, living a very different life than the average American today. They grew their own food, made their own clothes, and cooked their own meals.

This self-reliance extended ever further. Every single thing that our rural ancestors owned was either built by them, or by someone nearby. A community of a few dozen people could create every single item owned by every single member of the community. From horseshoes to plows to dresses to seeds, all of the requisite knowledge, resources, and skills were provided by the members of the community.

We live in a very different world. In looking around my modest apartment, I found it very difficult to find items that I or anyone that I know could produce. Even common items, like plastic bags, shoes, and cloth I have no idea how to make. I was washing my hands, and realized that I have no idea how the soap dispenser works.

The urbanization of the world has allowed us to increase our knowledge specialization, and decrease our knowledge density, which I will define as the proportion of the population which need to know how to make or do a certain thing. Living closer together meant that it became more efficient for a baker to make your bread than for you to make it yourself. Because bakers could spend so much of their time making bread, they soon learned more about bread-making than anyone knew before. The Internet continues to decrease our knowledge density, as knowledge can be stored and shared across the world, and across time.

This specialization of knowledge has led to a situation that is unique in human history - products which no individual understands completely. Computer software is a prime example. While there are many people who understand the basic concepts behind how a computer works, there are so many pieces layered on top of and integrated with one another, that understanding them all is impossible. Some people understand how to write a program, others understand how to compile it, others understand how the compiled program interacts with the operating system, others understand how the operating system interacts with the hardware, but I think that it would be fair to say that no one in the world has the knowledge to recreate the entire system. No one knows how a computer works.

At first, this concept is scary. While our collective knowledge has increased many times over, our individual knowledge of basic self-sufficiency has dropped dramatically. We do not know how to create even the things that we use every day. Stranded in the wilderness, we not only could not recreate the tools of today, but we couldn't even create the tools of our ancestors. We would be in much worse shape than the farmer in our introduction.

However, I think that is a price that is worth paying. By specializing our knowledge, we have been able to learn and know and do things that our ancestors couldn't even dream of. We no longer need to know how to do everything for ourselves, and that excess of time, energy, and mental capacity has allowed us to expand the borders of our societal knowledge and ability orders of magnitude faster than at any point in human history. At the same time, we have become much better at providing for our basic needs.

All of these miracles have occurred because we have embraced ignorance, and been willing not to understand.