A patch for SpamBayes to record URLs' IPs and a cache for PyDNS


Creating synthetic tokens for the IP that a URL's host part points to seems to increase SpamBayes's accuracy on certain sorts of spam and, if you do that, a DNS cache is useful if you're doing lots of scoring

Non-geeks and geeks not interested in the details of Bayesian spam filtering may prefer to skip this post.

A while ago I mentioned the spam filter SpamBayes. I've used it almost from the beginning and it works very well for me.

Starting early this year, I found that spammers had begun sending messages with bland or almost entirely nonsense text and a link to click on. SpamBayes would generally score them as unsure because they contained so little information that it could make use of. (In his original article, Paul Graham predicted that spammers would respond that way to the widespread use of Bayesian filters.)

Turning on SpamBayes's mine_received_headers option helps, but not enough in my experience. Especially if you have any legitimate correspondents on Comcast's network.

In April, I posted a patch for SpamBayes's tokenizer to the spambayes-dev list that creates synthetic tokens for the IP addresses that the host part of the URLs in a message resolve to. That turns out to help a lot on those messages. That's because the IPs of spammers' webservers aren't uniformly distributed. Indeed, it seems that there are relatively few networks that are willing to host spammers' websites and it doesn't take very long for SpamBayes to start using those tokens as evidence.

At first, it didn't seem to help. At least not much. It even produced a small decrease in accuracy in some cases. But that was on historical data. On more recent data, it's a significant win for me.

If you're doing a lot of scoring all at once (as you might with certain training regimes), doing lookups that way generates a lot of DNS traffic. Unless your resolving DNS server is electronically very close to you (like on the same Ethernet segment), that's going to slow scoring down a fair amount. Depending on the details of your situation, it may also be a significant load on your (or your ISP's) DNS server. To deal with that, I've hacked up a cache for PyDNS and a slightly different version of the patch. (With the new version, the clue "timeout" is now a slight misnomer. A better name would be something more generic like "error", but I've left it as it was for compatibility with the data in my database.)

By default, the cache respects the time-to-live of the data returned by the resolving name server it uses. The resolving name server component of D.J. Bernstein's djbdns returns TTLs of zero under most circumstances. Probably some others do too. Dan explains that that's to prevent cache snooping. If the cache doesn't seem to speed scoring up, you can set its attribute printStatsAtEnd to see if you're getting any cache hits. If low TTLs turn out to be the problem, you can set the attribute minTTL to 300 or 600 seconds or something harmlessly small like that and the cache will cache everything for at least that long.

Posted: Sun - June 27, 2004 at 08:46   Main   Category: 


©