Don’t call me DOM

6 July 2004

Spam statistics

Filed under:

I’ve run very crude statistics on the amount of spam I’m getting and filtering in the past 6 months:

  • I’m getting between 500 and 600 messages a day
  • among those, around 400 are spam
  • among them, the vast majority (~90%) is simply trashed, relying on SpamAssassin – I basically direct all messages with a SA score greater than 12 to /dev/null
  • on the remaining 10%, 90 to 95% are put in a distinct mailbox (cleverly labeled spam)
  • … which leaves me with about 3 spams in my inbox per day, which is quite manageable
  • the false positives for these 6 months amount to less than 50 messages, most of them wouldn’t have been a big loss if I hadn’t spotted them in my spam mailbox

I’ve gathered these statistics from my procmail log, using a simple grep, à la grep -A 2 "Mon Jul 5" .procmail/log |grep "/dev/null"|wc -l (this gives the number of instance of messages directed to /dev/null on Monday, July 5th).

With a simple loop, I can get data across the past months in a comma-separated format: for i in `seq 190 -1 1` ; do day=`LC_ALL="en" date --d "$i days ago"|cut -b -11`; echo $day , `grep -A2 "$day" .procmail/log|grep "/dev/null"|wc -l` , `grep -A2 "$day" .procmail/log|grep "spam"|wc -l`; done > spam-evolution.csv; once loaded in Gnumeric (the Gnome equivalent of Excel), I can get graphics of the evolution of the repartition of my mail between /dev/null and my spam mailbox:

The one below (time on horizontal axis, number of messages on vertical) shows that the number of spams has steadily grown in the past months:
Repartition of my spam between trash and spam mailbox, received between start of January and end of June 2004
It would be interesting to see whether the peeks here and there matches some of the spam storms W3C has encountered this year.

The following graph (percentage of spams directly trashed vs those in my spam mailbox over time) seems to indicate either a slow gain in efficiency in SpamAssassin at trashing spams, or a raise of the rate of obvious spams
Evolution of the ration between my trashed spam and the one that wasn't directly trashed
I’m tempted to think it’s the former, given that spams tend to get “smarter” at hiding themselves, and since there is a rational explanation for SpamAssassin at getting better – I’m using its Bayesian training functions. Also worth noting is that the variation of this percentage seems to diminish over time, although I have no idea how this should be interpreted!

Picture of Dominique Hazael-MassieuxDominique Hazaël-Massieux (dom@w3.org) is part of the World Wide Web Consortium (W3C) Staff; his interests cover a number of Web technologies, as well as the usage of open source software in a distributed work environment.