Don’t call me DOM

26 April 2005

Updated spams statistics

Filed under:

A little more than 9 months ago, I ran some statistics on the rate of spams I receive, and given that our anti-spam set up was recently improved to reject even more buggy messages than before, I decided it was a good time to see what the evolution over the past 6 months was:

Evolution of my spam levels during the past 6 months

The blue line is the number of messages that are directly trashed when arriving in my mailbox because their SpamAssassin score is greater than 12; the pink line is the number of messages that goes into a separate mailbox that I review periodically to find false positives, which still happen from time to time. The graphics doesn’t show the number of spam messages that I get in my final inbox; it’s never more than one or two a day, usually zero.

What this graphic shows is how much the number of egregious spams (those that SpamAssassin notes as more than 12) that are distributed to me has dropped in the past few days; the key changes in our anti-spam configuration that triggered that change were:

  1. first, rejecting messages with many unescaped 8-bits characters in their subjects, which I think matches the first inflection in the graphic around 160; as it occurs, sending 8 bits characters in messages headers is invalid, and has been a very good indicator of uncaring senders. Note that as good email and Web citizens, our reject bounces document why we do so, and how to get around it.
  2. secondly, the very sharp drop around 175 is a consequence of having SpamAssassin running at our MX level, and rejecting any message scoring more than 10; the benefit of having SpamAssassin running higher up in our mail distribution is pretty clear: instead of having 70 SpamAssassin running to check a message that have been sent to the 70 members of the Team, a single instance can reject it if it is really too spam-alike.

But why are there still messages that get discarded by my SpamAssassin with a score higher than 12, then? Because my SpamAssassin is carefully trained with SA bayesian system, and so is more accurate to find egregious spams. But even that may soon be no longer relevant, since we’re looking into feeding the instances running on our MX with blatant spams (either through honeypots or through messages marked as spams) and well-known ham…

The only dark side to this graphic is what the pink line shows: the number of messages that I get to review to detect false positive is significantly increasing. I could just give up and not review them anymore, but since I still get some of these, I can’t really feel confident about this.

The other option would be to lower the go directly to trash threshold, but I quickly checked on my false positives mailbox (that I use to train SpamAssassin) with: grep "X-Spam-Status:" ~/mail/ham |sed -e "s/.*=\([-0-9\.]*\) required=.*/\1/"|sort -n|uniq, and I had a few messages with a score higher than 9 as false positive, so I can’t really set it below 10…

Picture of Dominique Hazael-MassieuxDominique Hazaël-Massieux (dom@w3.org) is part of the World Wide Web Consortium (W3C) Staff; his interests cover a number of Web technologies, as well as the usage of open source software in a distributed work environment.