I admit, I'm a bit perplexed as to why this paper got slashdotted now; it's pretty old, representing research I did mostly in 2003. My other publications can be viewed from my university website (which is much better equipped for a slashdotting than this server is). These papers answer a lot of the common questions that weren't answered in this article due to space constraints and simple lack of interest for the general audience of the magazine.
Here's the answer to some common questions, though:
How are we really going to stop spam?
The tongue-in-cheek answer I give to kick off my conference presentations:
- Spam exists because spammers make money.
- Spamers make money because people buy their products.
- Only stupid people buy stuff from spam.
Therefore, the way to stop spam is to keep stupid people away from email!
And yes, I realise there are flaws with this solution too. But you have to admit that the economic logic is compelling! :)
Seriously, we're going to need lots of different solutions to be in place, because what we really need to do is make it hard for spammers to get their mail to enough people for them to make money. We don't want a monoculture here -- we want lots of diverse solutions so one carefully-crafted mail still doesn't hit everyone.
Why an immune system?
It's a compelling metaphor for various reasons, including the inherent diversity of immune systems. But what I'd really hoped to do was look at the effect of mutation on the heuristics. The human immune system has this fascinating process called hypermutation that takes an "okay" detector and tries to make it into a better one through mutation and testing. This paper does not include this mutation work, and just shows that the metaphor lends itself to reasonable classification rates. Sadly, I have not yet been able to do work with hypermutation in spam heuristics, although I recommend you look at Andy Secker's work in AIS email classification for a hint of what can be done.
How is this biological?
"Inspired by..." -- A colleague of mine once commented that if you can't explain a computer science solution in terms of math, then you're probably just handwaving. I've attempted to show everything in terms of algorithms and math, but the immune system is what got me thinking about doing things in this way, and it lends itself to some interesting new ideas in the subject, such as mutation, useage of "slowly" evolved gene libraries in conjunction with the faster weighting process, and so on.
How is this different from a Bayesian classifier?
This system can easily be reduced to a Bayesian classifier: flip the right flags for that weighting, reduce the number of genes per lymphocyte to 1, and use genes that are exactly what a Bayesian tokenizer produces.
What makes it different is that it can use much more complex genes fragments. A human immune system, when producing new antibodies, doesn't go back to the 1 and 0 equivalents in DNA. The units used are fairly complex, and represent years of evolution -- your parents helped define the set that you've got to start with. The idea here is that while Bayesian solutions are great, they're known to be beatable in some ways, and this sort of system would allow people to use a bit more prior knowledge to kickstart their systems, if they had heuristics available. Preliminary work suggests that heuristics can be used to reduce false positives (legit emails classified as spam).
I admit, I thought it might just be fun to make heuristics and see how they did in the "competition" to produce spam. If you came up with really good heuristics, you'd be able to "vaccinate" other people's systems by providing actual antibodies, or even by providing the gene fragments so they could produce their own combinations. Sure, it's not very true to the biology, but it might be interesting.
At the time, I don't think Bayesian filters did time-decay, but to be honest, I don't remember. The whole learning-from-classification in steps (eg: if I say this message has a 80% chance of being spam, I update my lymphocytes in relation to that, rather than updating them on the assumption that it really is spam) is probably also novel.
Why the SA corpus?
It was the only halfway decent one available at the time I was doing this research. I do have results for my personal mail (they're mostly in an earlier paper), but reviewers on that paper suggested strongly that I use a publically available corpus, and this was the result of that experiment.
I'd like to try it on the Enron corpus now, but to be honest, I'm not sure we have any really good spam and ham corpuses available for testing this sort of thing. If anyone wants to sit down and help me make one, I'm interested in getting permission to collect real and spam mail from a single source (preferably a single email address, but potentially a single server).
Are you still working on this?
Not exactly. There's a lot of interesting things that I'd like to try, but never enough hours in the day, and, well, I've already got the Master's degree from my work with spam immunity. I'm currently working on my PhD, which will likely be in something more security-related... after a while, a girl can only take so many emails about Viagra before she wants to find something new to study for a living. :)
I'm hoping to find other students who might be interested in continuing the research, though, and I do regularly get emails from students around the world who have read my paper and want more info, though, so who knows?
Can I have your source code?
The source code for this is currently un-released. I've been strongly encouraged to release it, but, again, there are only so many hours in the day and I'm a bit shy about showing that source to the world. All the algorithms are in my papers, so you could write your own with relative ease if you're feeling up to it.
Thanks for coming out! I'm really glad people find this idea interesting enough to click a link and read a paper. If you have any more questions, my email address is terri (at) zone12.com -- and yes, I realise I'm going to regret posting that, but I really do like discussing my research enough to risk the spam and trolls that may result.