<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Don't call me DOM &#187; spam</title>
	<atom:link href="http://people.w3.org/~dom/archives/category/email/spam/feed/" rel="self" type="application/rss+xml" />
	<link>http://people.w3.org/~dom</link>
	<description>W3C has the DOM, and the Dom ; pick the one you prefer.</description>
	<lastBuildDate>Sat, 07 Nov 2009 11:02:42 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.5</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>The beauty of HTMLMediaElement</title>
		<link>http://people.w3.org/~dom/archives/2009/02/the-beauty-of-htmlmediaelement/</link>
		<comments>http://people.w3.org/~dom/archives/2009/02/the-beauty-of-htmlmediaelement/#comments</comments>
		<pubDate>Fri, 13 Feb 2009 13:25:43 +0000</pubDate>
		<dc:creator>Dom</dc:creator>
				<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Work environment]]></category>
		<category><![CDATA[spam]]></category>

		<guid isPermaLink="false">http://people.w3.org/~dom/?p=247</guid>
		<description><![CDATA[So, while exploring the world of Web video, after having successfully transcribed a one hour long video of one my presentations,
and turned that transcription into an HTML 5 video with subtitles, I started to look in more details as to what HTML 5 brought to the table that made this synchronization possible.
The rather obvious change [...]]]></description>
			<content:encoded><![CDATA[<p>So, while <a href="http://people.w3.org/~dom/archives/2009/02/exploring-the-world-of-web-video/">exploring the world of Web video</a>, after having successfully <a href="http://people.w3.org/~dom/archives/2009/02/diving-in-transcription/">transcribed a one hour long video of one my presentations</a>,
and turned that transcription into <a href="http://people.w3.org/~dom/archives/2009/02/synchronizing-text-and-video/">an HTML 5 video with subtitles</a>, I started to look in more details as to what HTML 5 brought to the table that made this synchronization possible.</p>
<p>The rather obvious change that HTML 5 brings to the table is the <a href="http://dev.w3.org/html5/spec/Overview.html#htmlmediaelement"><code>HTMLMediaElement</code></a> DOM Interface, and in particular the <code>currentTime</code> property, which at any time reflects the part of the media content that is played.</p>
<p>This means that it allows to synchronize any part of your HTML page with the video, as well as navigate through the video by setting the property to the section of the video you want to play!</p>
<p>And since I had already gathered a lot of timing information in the transcript of the video, extracting meaningful timings of the various sequences of the video was again only an XSLT style sheet away, provided I added relevant metadata in the transcription: typically, identifying subsections as <code>&lt;div&gt;</code> in the timedtext transcript, with a <code>ttm:title</code> set (which I achieved directly through my transcribing tool, Transcriber, that has all the needed interfaces to set these metadata).</p>
<p>And so I wrote that <a href="http://www.w3.org/2009/02/presentation-viewer/create-viewer-from-dfxp.xsl">XSLT</a>, added some further out-of-band metadata linking to <a href="http://www.w3.org/2009/02/presentation-viewer/parisweb-topic-slides.xml">slides</a> and additional <a href="http://www.w3.org/2009/02/presentation-viewer/parisweb-topic-notes.xml">notes</a> that I wanted to include in my presentation viewer (<a href="http://www.w3.org/2009/02/presentation-viewer/create-viewer-from-dfxp.xsl">more details on the process involved are available</a>).</p>
<p>The fact that I couldn&#8217;t embed these additional data in TimedText is actually quite disappointing &#8211; that a Web format should be developed without any way to add hyperlinks seems quite wrong! Generally speaking, it&#8217;s not clear to me that timedtext should be anything else than a set of additional timing attributes on top of XHTML &#8211; but I can&#8217;t claim that I have explored that space sufficiently to give much credit to that assertion.</p>
<p>Given that these metadata were not stored in the TimedText file, I ended up having them embedded in the resulting HTML page; it occured to me that the best combination to store them there was to use the <a href="http://www.w3.org/2008/WebVideo/Fragments/wiki/Syntax">extremely experimental media fragment syntax</a> within an <a href="http://www.w3.org/TR/rdfa-syntax/">RDFa description of the table of content</a>, e.g.:</p>
<pre><code>&lt;ul class="toc">
            &lt;li about="http://media.w3.org/2007/11/parisweb-dom.ogv#t=00:00:44.209,00:01:28.432">
               &lt;a target="slides" rel="foaf:depiction"  
                    property="dc:title" 
                    href="http://www.w3.org/2007/Talks/11-parisweb/slide-1.html">
                    Introduction
                &lt;/a>
            &lt;/li>
&lt;/ul></code></pre>
<p>This essentially annotates a given section of the video (<code>#t=00:00:44.209,00:01:28.432</code> meaning between 44.209 seconds after the start of the video and 1 minute 28.432 second after the start) with a title and an illustration (in this case, the accompanying slide) &#8211; I chose <code>foaf:depiction</code> as a property, but it probably isn&#8217;t the best match &#8211; I&#8217;m hoping thet <a href="http://www.w3.org/2008/01/media-annotations-wg.html">Media Annotations Working Group</a> will come up with a useful ontology that could be used in these types of contexts.</p>

<p>These annotations are then parsed by a small <a href="http://www.w3.org/2009/02/presentation-viewer/sync.js">Javascript layer</a> (built on top of JQuery) which reproduces most of what the <a href="http://www.w3.org/2008/12/dfxp-testsuite/web-framework/HTML5_player.js">TimedText javascript player</a> does, but in a much less verbose way&hellip; &#8211; another incitation for hoping that timedtext was really just XHTML.</p>

<p>The <a href="http://www.w3.org/2009/02/presentation-viewer/parisweb2007-dom.html">resulting presentation viewer</a> allows to navigate through the video, with synchronized slides, notes and subtitles, provided your browser supports the <code>HTMLMediaElement</code> interface, as Firefox 3.1 does:</p>
<object  height="347" width="420" type="application/x-shockwave-flash" name="mpl" data="http://dotsub.com/static/players/portalplayer.swf">

<param name="swliveconnect" value="true"  />

<param name="allowFullScreen" value="true" />

<param name="allowScriptAccess" value="always" />

<param name="flashvars" value="mediauri=/media/ed3cbe9c-07d1-45fc-bd76-2d0d58870e0e/m/flv/en&amp;screenshoturi=http://dotsub.com/media/ed3cbe9c-07d1-45fc-bd76-2d0d58870e0e/p&amp;mediaDuration=38000&amp;lang=eng "/>

<object height="347" width="420" type="video/x-flv" data="http://dotsub.com/media/ed3cbe9c-07d1-45fc-bd76-2d0d58870e0e/m/flv/e" ></object>

</object>
<p>(also available as <a href="http://media.w3.org/2009/02/presentation-viewer-screencast.ogv">Ogg/Theora video</a> with a <a href="http://media.w3.org/2009/02/presentation-viewer-screencast.xml">Timed Text transcript</a>.)</p>
<p>It also carries a set of <a href="http://www.w3.org/2007/08/pyRdfa/extract?uri=http%3A%2F%2Fwww.w3.org%2F2009%2F02%2Fpresentation-viewer%2Fparisweb2007-dom.html&amp;format=pretty-xml&amp;warnings=false&amp;parser=lax&amp;host=xhtml&amp;space-preserve=true&amp;submit=Go%21">RDF annotations to the video itself</a>.</p>
<p>(I discovered only today that apparently <a href="http://blog.gingertech.net/2008/09/27/demo-of-new-html5-features/">Ian Hickson made a very similar demonstration</a> a few months ago)</p>
<p>I must confess that I&#8217;m not quite sure that the accessibility of the resulting page is great &#8211; it uses the &lt;object&gt; element to load external pages (the  slides and notes), while it should probably include their content automatically through AJAX, with a pinch of <a href="http://www.w3.org/WAI/intro/aria">WAI ARIA</a> to alert of pages updates.</p>
<p>The AJAX inclusion of content  would be much facilitated by <a href="http://www.w3.org/TR/css-style-attr">scoped style sheets</a>.</p>

<p>My conclusion from this exploration is that clearly the new  <code>HTMLMediaInterface</code> DOM interface is of great importance to really bring video (and similarly audio) into the Web browser; I can see how it could be improved to make it much easier to create synchronization effects:</p>
<ul>
<li><del>using some sort of timer callback interface with a begin and end period &#8211; currently, you have to use the generic <code>setInterval</code> function that polls every tenth of second to check whether you are in a period of the video where something should happen; ideally, you would just say <code>video.setTimer(<var>callback_function</var>,<var>begin</var>,<var>end</var>)</code>, and <code><var>callback_function</var></code> would be called each time the video enters the period of time between <code>begin</code> and <code>end</code>;</del><ins>Err&#8230; it seems that&#8217;s exactly what <a href="http://www.w3.org/TR/2009/WD-html5-20090212/l#dom-media-addcuerange"><code>addCueRange</code></a> is about.&hellip; I guess what I really need is <a href="https://developer.mozilla.org/En/NsIDOMHTMLMediaElement#addCueRange()">having it implemented</a> :)</ins></li>
<li>ensuring that the HTMLMediaInterface is applied to any element where a time-based animation is used: being able to use it on SVG animations, Flash animations, and maybe even animated GIF (!) sounds as useful as on videos and audios; maybe this &#8220;just&#8221; means that the <code>&lt;video&gt;</code> element implementations should support <code>image/svg+xml</code> and <code>image/gif</code> as acceptable media types?</li>
<li>it seems really backward that any JavaScript layer be required at all to run synchronized subtitles, and the <code>&lt;video;gt;</code> element should clearly support linking media content and their transcript in a uniform way;</li>
<li>it would be really neat if the media fragment URIs could be used again directly to go to a particular section of a video included in the page, without the Javascript layer.</li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://people.w3.org/~dom/archives/2009/02/the-beauty-of-htmlmediaelement/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
<enclosure url="http://media.w3.org/2007/11/parisweb-dom.ogv#t=00:00:44.209" length="85117168" type="video/ogg" />
<enclosure url="http://dotsub.com/media/ed3cbe9c-07d1-45fc-bd76-2d0d58870e0e/m/flv/e" length="2433090" type="video/x-flv" />
<enclosure url="http://media.w3.org/2009/02/presentation-viewer-screencast.ogv" length="688128" type="video/ogg" />
		</item>
		<item>
		<title>Small SURBL Python library</title>
		<link>http://people.w3.org/~dom/archives/2006/04/small-surbl-python-library/</link>
		<comments>http://people.w3.org/~dom/archives/2006/04/small-surbl-python-library/#comments</comments>
		<pubDate>Tue, 18 Apr 2006 11:59:25 +0000</pubDate>
		<dc:creator>Dom</dc:creator>
				<category><![CDATA[Web development]]></category>
		<category><![CDATA[spam]]></category>

		<guid isPermaLink="false">http://people.w3.org/~dom/archives/2006/04/small-surbl-python-library/</guid>
		<description><![CDATA[The spammers have striken again, and we received reports that one of our extremely useful public service was used to work around URL matching techniques for spammers. In other words, a spammer who would have been identified (in email messages, blog comments) as using http://example.net/ as a URI in his spam could workaround it by [...]]]></description>
			<content:encoded><![CDATA[<p>The spammers have striken again, and we received reports that one of our extremely useful public service was used to work around URL matching techniques for spammers. In other words, a spammer who would have been identified (in email messages, blog comments) as using <code>http://example.net/</code> as a URI in his spam could workaround it by putting a link to <code>http://our-useful-service.example.org?uri=http://example.net/</code> instead, and given that the said service more or less entirely preserves the content as is, this allowed indeed to put a link to the incriminated content.</p>
<p>The reporter of this abuse of service had the good idea to mention an existing technical solution to this type of problem: <a href="http://www.surbl.org/">SURBL</a> is a registry of registered domain names that have been reported as used by spammers. Although I&#8217;m not a big fan of this type of registries (they really seem the lowest type of trust network one can imagine, too easily abused), faced with the alternative of shutting down the service entirely or reducing the possibility of abusing it, I took the second option.</p>
<p>As the said service (that I&#8217;m not mentioning explicitely in case it would draw more attention that it needs at this point) is written in Python, I&#8217;ve been looking for a Python implementation of the SURBL mechanism; unfortunately, I haven&#8217;t been able to found one and so had to write my <a href="http://dev.w3.org/cvsweb/~checkout~/2006/surbl.py?rev=1.2&amp;content-type=text/plain">own implementation</a>, which seems to work well (units tests also helped) and now supports the alluded on-line service.</p>
<p>SURBL uses DNS queries as a way to query their registry: you ask whether <code>spam.example.multi.surbl.org</code> exists; if it does, then <code>spam.example</code> is part of the blacklist, otherwise it isn&#8217;t. In terms of implementation, the only (rather small) difficulty is to identify the relevant part of the URI you want to check, namely the registered name used in the authority component of the URI. This implies removing the possible port and user information parts in the authority component, but also the possible sub-domains of a registered domain; this would be entirely trivial if one didn&#8217;t have to take into accounts delegated second-top-level domain names (e.g. as <code>co.uk</code>).</p>
<p>SURBL&#8217;s architecture is a rather smart way to re-use the existing caching/query infrastructure deployed for DNS (e.g. I just had to import <a href="http://dnspython.org">DNS Python</a> to query their blacklist). I suspect that the software infrastructure behind DNS implements caching more widely than most implementations of HTTP do, and thus would induce a smaller load on their servers &#8211; although I haven&#8217;t quite checked it. Of course, the DNS system is particularly well fitting to this, given that the queried items follow (by definition) the naming rules of DNS. Also, I guess that the primary usage of the system (mail filtering) made DNS a pre-requisite, while access to an HTTP client may be less obvious. But I wonder if there would be any compelling reasons to make this also available as an HTTP service?</p>
]]></content:encoded>
			<wfw:commentRss>http://people.w3.org/~dom/archives/2006/04/small-surbl-python-library/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Updated spams statistics</title>
		<link>http://people.w3.org/~dom/archives/2005/04/updated-spams-statistics/</link>
		<comments>http://people.w3.org/~dom/archives/2005/04/updated-spams-statistics/#comments</comments>
		<pubDate>Tue, 26 Apr 2005 14:48:09 +0000</pubDate>
		<dc:creator>Dom</dc:creator>
				<category><![CDATA[spam]]></category>

		<guid isPermaLink="false">http://people.w3.org/~dom/archives/2005/04/updated-spams-statistics/</guid>
		<description><![CDATA[A little more than 9 months ago, I ran some statistics on the rate of spams I receive, and given that our anti-spam set up was recently improved to reject even more buggy messages than before, I decided it was a good time to see what the evolution over the past 6 months was:

The blue [...]]]></description>
			<content:encoded><![CDATA[<p>A little more than 9 months ago, I <a href="http://people.w3.org/~dom/archives/2004/07/spam-statistics/">ran some statistics on the rate of spams I receive</a>, and given that our anti-spam set up was recently improved to reject even more buggy messages than before, I decided it was a good time to see what the evolution over the past 6 months was:</p>
<p><img src='http://people.w3.org/~dom/wp-content/spam-evolution-20050426.png' alt='Evolution of my spam levels during the past 6 months' /></p>
<p>The blue line is the number of messages that are directly trashed when arriving in my mailbox because their SpamAssassin score is greater than 12; the pink line is the number of messages that goes into a separate mailbox that I review periodically to find false positives, which still happen from time to time. The graphics doesn&#8217;t show the number of spam messages that I get in my final inbox; it&#8217;s never more than one or two a day, usually zero.</p>
<p>What this graphic shows is how much the number of egregious spams (those that SpamAssassin notes as more than 12) that are distributed to me has dropped in the past few days; the key changes in our anti-spam configuration that triggered that change were:</p>
<ol>
<li>first, rejecting messages with many unescaped 8-bits characters in their subjects, which I think matches the first inflection in the graphic around 160; as it occurs, sending 8 bits characters in messages headers is invalid, and has been a very good indicator of uncaring senders. Note that as good email and Web citizens, our reject bounces document <a href="http://www.w3.org/Mail/unencoded-8bits.html">why we do so, and how to get around it</a>.</li>
<li>secondly, the very sharp drop around 175 is a consequence of having SpamAssassin running at our MX level, and rejecting any message scoring more than 10; the benefit of having SpamAssassin running higher up in our mail distribution is pretty clear: instead of having 70 SpamAssassin running to check a message that have been sent to the 70 members of the Team, a single instance can reject it if it is really too spam-alike.</li>
</ol>
<p>But why are there still messages that get discarded by my SpamAssassin with a score higher than 12, then? Because my SpamAssassin is carefully trained with <abbr title="SpamAssassin">SA</abbr> bayesian system, and so is more accurate to find egregious spams. But even that may soon be no longer relevant, since we&#8217;re looking into feeding the instances running on our MX with blatant spams (either through honeypots or through <a href="http://people.w3.org/~dom/archives/2004/09/annospam/">messages marked as spams</a>) and well-known ham&#8230;</p>
<p>The only dark side to this graphic is what the pink line shows: the number of messages that I get to review to detect false positive is significantly increasing. I could just give up and not review them anymore, but since I still get some of these, I can&#8217;t really feel confident about this.</p>
<p>The other option would be to lower the <q>go directly to trash</q> threshold, but I quickly checked on my false positives mailbox (that I use to train SpamAssassin) with: <code>grep "X-Spam-Status:" ~/mail/ham |sed -e "s/.*=\([-0-9\.]*\) required=.*/\1/"|sort -n|uniq</code>, and I had a few messages with a score higher than 9 as false positive, so I can&#8217;t really set it below 10&#8230;</p>]]></content:encoded>
			<wfw:commentRss>http://people.w3.org/~dom/archives/2005/04/updated-spams-statistics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bulk-delete comments in wordpress</title>
		<link>http://people.w3.org/~dom/archives/2004/11/bulk-delete-comments-in-wordpress/</link>
		<comments>http://people.w3.org/~dom/archives/2004/11/bulk-delete-comments-in-wordpress/#comments</comments>
		<pubDate>Tue, 02 Nov 2004 11:43:13 +0000</pubDate>
		<dc:creator>Dom</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[spam]]></category>

		<guid isPermaLink="false">http://people.w3.org/~dom/archives/2004/11/bulk-delete-comments-in-wordpress/</guid>
		<description><![CDATA[As many others, I&#8217;ve recently been receiving plenty of spammy comments on my blog; all of these are queued for moderation by default, but even that get a bit painful when flooded with moderation messages in my inbox, and having to delete individually each message. Even the bulk delete command in Wordpress doesn&#8217;t allow you [...]]]></description>
			<content:encoded><![CDATA[<p>As many others, I&#8217;ve recently been receiving plenty of spammy comments on my blog; all of these are queued for moderation by default, but even that get a bit painful when flooded with moderation messages in my inbox, and having to delete individually each message. Even the bulk delete command in Wordpress doesn&#8217;t allow you to delete all (or most of) the comments at once.</p>
<p>I got around the moderation flood in my inbox using yet another procmail rule, and got around the second part using a <a href="javascript:%28function%28%29%7Bfields%3Ddocument.getElementsByTagName%28%27input%27%29%3Bfor%20%28i%3D0%3Bi%3Cfields.length%3Bi%2B%2B%29%20%7B%20if%20%28fields%5Bi%5D.value%3D%3D%27delete%27%29%20%7B%20fields%5Bi%5D.checked%3D1%20%7D%20%7D%20%7D%29%28%29">small bookmarklet</a>, which runs the following javascript code:</p>
<pre><code>&#xA0;fields=document.getElementsByTagName('input');
for (i=0;i&lt;fields.length;i++) {
  if (fields[i].value=='delete') {
     fields[i].checked=1
  }
}&#xA0;</code></pre>
<p>In other words, find all the <code>input</code> elements, and if they have a <code>value</code> set to <code>delete</code> (the current convention in my WordPress version), mark them as <code>checked</code>; only needs to push the moderate button afterwards. Of course, this works well only if you receive mostly spam (as in my case) compared to real comments.</p>]]></content:encoded>
			<wfw:commentRss>http://people.w3.org/~dom/archives/2004/11/bulk-delete-comments-in-wordpress/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Annospam</title>
		<link>http://people.w3.org/~dom/archives/2004/09/annospam/</link>
		<comments>http://people.w3.org/~dom/archives/2004/09/annospam/#comments</comments>
		<pubDate>Fri, 24 Sep 2004 13:16:57 +0000</pubDate>
		<dc:creator>Dom</dc:creator>
				<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Web development]]></category>
		<category><![CDATA[spam]]></category>

		<guid isPermaLink="false">http://people.w3.org/~dom/archives/2004/09/annospam/</guid>
		<description><![CDATA[I have been busy lately deploying a tool that I (and others) had started to develop one year ago, and had been stalled since then, informally called Annospam; the tool allows to cleanse W3C Mailing List Archives from its huge number of spams they host and are likely to continue to receive, however clever our [...]]]></description>
			<content:encoded><![CDATA[<p>I have been busy lately deploying a tool that I (and others) had started to develop one year ago, and had been stalled since then, informally called Annospam; the tool allows to cleanse W3C Mailing List Archives from its huge number of spams they host and are likely to continue to receive, however clever our anti-spams systems are getting.</p>
<p>The idea is to use the <a href="http://www.w3.org/2002/12/AnnoteaProtocol-20021219">Annotea protocol</a> as a way to store and retrieve spam marks on archived messages, and to regenerate the relevant archives based on these marks; it uses lots of W3C Technologies (<a href="http://www.w3.org/2003/08/kill-spam">XSLT as a way</a> to build a user interface, RDF/XML as a data format, HTTP as a query/update protocol), which makes it really interesting, if sometimes somewhat challenging.</p>
<p>I hope to get on finishing a proper documentation for it soon enough, but if you look well enough, you should already be able to see some mailing lists archives being cleaned through this very system&#8230; (hint: to detect a cleaned mailing list, find those where the number of messages displayed in the cover page doesn&#8217;t match the one displayed in a period-page; and yes, this is a bug :) )</p>]]></content:encoded>
			<wfw:commentRss>http://people.w3.org/~dom/archives/2004/09/annospam/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Give Spammers a rest!</title>
		<link>http://people.w3.org/~dom/archives/2004/09/give-spammers-a-rest/</link>
		<comments>http://people.w3.org/~dom/archives/2004/09/give-spammers-a-rest/#comments</comments>
		<pubDate>Mon, 13 Sep 2004 13:11:10 +0000</pubDate>
		<dc:creator>Dom</dc:creator>
				<category><![CDATA[spam]]></category>

		<guid isPermaLink="false">http://people.w3.org/~dom/archives/2004/09/give-spammers-a-rest/</guid>
		<description><![CDATA[Spammers, like many people down here, needs to rest after all their efforts; spammers needs to take a week-end break, too, as shows the repartion of the number of messages per weekday:
 (1 is Monday, 7 Sunday) These plots are based on the spam I received in the past 2 months.
Note that in fact, this [...]]]></description>
			<content:encoded><![CDATA[<p>Spammers, like many people down here, needs to rest after all their efforts; spammers needs to take a week-end break, too, as shows the repartion of the number of messages per weekday:</p>
<p><img src="http://people.w3.org/~dom/wp-content/spam-per-day.png" alt="Statistics of received message per weekday" /> (1 is Monday, 7 Sunday) These plots are based on the spam I received in the past 2 months.</p>
<p>Note that in fact, this interpretation is probably buggy; for instance, it&#8217;s likely that a fair number of Zombies computers used to send spams  are shut down during the week-end.</p>]]></content:encoded>
			<wfw:commentRss>http://people.w3.org/~dom/archives/2004/09/give-spammers-a-rest/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fake SpamAssassin headers</title>
		<link>http://people.w3.org/~dom/archives/2004/07/fake-spamassassin-headers/</link>
		<comments>http://people.w3.org/~dom/archives/2004/07/fake-spamassassin-headers/#comments</comments>
		<pubDate>Wed, 28 Jul 2004 22:30:23 +0000</pubDate>
		<dc:creator>Dom</dc:creator>
				<category><![CDATA[spam]]></category>

		<guid isPermaLink="false">http://people.w3.org/~dom/archives/2004/07/fake-spamassassin-headers/</guid>
		<description><![CDATA[Although my anti-spam set up works fairly well, I had been surprised in the past months (apparently starting end of May) to get some obvious spams (involving e.g. &#8216;Valium&#8217; in the subject) going through it without problems. Only today have I realized that this was because the mails were not checked by my SpamAssassin, but [...]]]></description>
			<content:encoded><![CDATA[<p>Although my <a href="http://people.w3.org/~dom/archives/2004/07/spam-statistics/">anti-spam set up works fairly well</a>, I had been surprised in the past months (apparently starting end of May) to get some obvious spams (involving e.g. &#8216;Valium&#8217; in the subject) going through it without problems. Only today have I realized that this was because the mails were not checked by <em>my</em> SpamAssassin, but (supposingly) by a SpamAssassin on popular free Web-based email services (e.g. yahoo or hotmail); that is, they included the following headers:</p>
<pre><code>X-Spam-Checker-Version: SpamAssassin 2.60-spambr_20030926a on <var>popular_mail_service</var>.com
X-Spam-Level:
X-Spam-Status: No, hits=-5.9 required=5.0 tests=AWL,NO_REAL_NAME autolearn=no
        version=2.60-spambr_20030926a</code></pre> 
<p>Due to the way my SpamAssassin set up works, they were not re-checked when entering my spam filters!</p>
<p>Although this should probably fixed at a higher level in our mail distribution system, I&#8217;ve worked around it with the following procmail rule:</p>
<pre><code># clean spurious SA headers
:0fw
* X-Spam-Checker-Version: SpamAssassin 2&#x5C;.60-spambr_20030926a on
| formail -IX-Spam-Status:</code></pre>
<p>I don&#8217;t want to remove any previous SpamAssassin header, since our mail set up does set one already that I can trust; but since we&#8217;re not using the same version as the one given in the <code>X-Spam-Checker-Version</code>, I&#8217;m on the safe side. And after a quick check, these spams amounted to around half of the spams that went through my filters in June, so I should get even better results with my anti-spam set up.</p>
<p>Well, until spammers start upgrading their fake headers, I guess.</p>]]></content:encoded>
			<wfw:commentRss>http://people.w3.org/~dom/archives/2004/07/fake-spamassassin-headers/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Spam statistics</title>
		<link>http://people.w3.org/~dom/archives/2004/07/spam-statistics/</link>
		<comments>http://people.w3.org/~dom/archives/2004/07/spam-statistics/#comments</comments>
		<pubDate>Tue, 06 Jul 2004 23:52:30 +0000</pubDate>
		<dc:creator>Dom</dc:creator>
				<category><![CDATA[spam]]></category>

		<guid isPermaLink="false">http://people.w3.org/~dom/archives/2004/07/spam-statistics/</guid>
		<description><![CDATA[I&#8217;ve run very crude statistics on the amount of spam I&#8217;m getting and filtering in the past 6 months:

I&#8217;m getting between 500 and 600 messages a day
among those, around 400 are spam
among them, the vast majority (~90%) is simply trashed, relying on SpamAssassin &#8211; I basically direct all messages with a SA score greater than [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve run very crude statistics on the amount of spam I&#8217;m getting and filtering in the past 6 months:</p>
<ul>
<li>I&#8217;m getting between 500 and 600 messages a day</li>
<li>among those, around 400 are spam</li>
<li>among them, the vast majority (~90%) is simply trashed, relying on <a href="http://www.spamassassin.org/">SpamAssassin</a> &#8211; I basically direct all messages with a SA score greater than 12 to <code>/dev/null</code></li>
<li>on the remaining 10%, 90 to 95% are put in a distinct mailbox (cleverly labeled <code>spam</code>)</li>
<li>&#8230; which leaves me with about 3 spams in my inbox per day, which is quite manageable</li>
<li>the false positives for these 6 months amount to less than 50 messages, most of them wouldn&#8217;t have been a big loss if I hadn&#8217;t spotted them in my spam mailbox</li>
</ul>
<p>I&#8217;ve gathered these statistics from my <a href="http://www.procmail.org/">procmail</a> log, using a simple <code>grep</code>, à la <kbd>grep -A 2 "Mon Jul  5" .procmail/log |grep "/dev/null"|wc -l</kbd> (this gives the number of instance of messages directed to <code>/dev/null</code> on Monday, July 5th).</p>
<p>With a simple loop, I can get data across the past months in a comma-separated format: <kbd>for i in `seq 190 -1 1` ; do day=`LC_ALL="en" date --d "$i days ago"|cut -b -11`; echo $day , `grep -A2 "$day" .procmail/log|grep "/dev/null"|wc -l` , `grep -A2 "$day" .procmail/log|grep "spam"|wc -l`;  done > spam-evolution.csv</kbd>; once loaded in <a href="http://www.gnome.org/projects/gnumeric/">Gnumeric</a> (the Gnome equivalent of Excel), I can get graphics of the evolution of the repartition of my mail between <code>/dev/null</code> and my spam mailbox:</p>
<p>The one below (time on horizontal axis, number of messages on vertical) shows that the number of spams has steadily grown in the past months:<br />
<img src="http://people.w3.org/~dom/wp-content/spam-evolution-200401-07.png" alt="Repartition of my spam between trash and spam mailbox, received between start of January and end of June 2004" /><br />
It would be interesting to see whether the peeks here and there matches some of the spam storms W3C has encountered this year.</p>
<p>The following graph (percentage of spams directly trashed vs those in my spam mailbox over time) seems to indicate either a slow gain in efficiency in SpamAssassin at trashing spams, or a raise of the rate of obvious spams<br /><img src="http://people.w3.org/~dom/wp-content/spam-evolution-pourcentage-200401-07.png" alt="Evolution of the ration between my trashed spam and the one that wasn't directly trashed" /><br />I&#8217;m tempted to think it&#8217;s the former, given that spams tend to get &#8220;smarter&#8221; at hiding themselves, and since there is a rational explanation for SpamAssassin at getting better &#8211; I&#8217;m using its <a href="http://spamassassin.sourceforge.net/doc/sa-learn.html">Bayesian training functions</a>. Also worth noting is that the variation of this percentage seems to diminish over time, although I have no idea how this should be interpreted!</p>]]></content:encoded>
			<wfw:commentRss>http://people.w3.org/~dom/archives/2004/07/spam-statistics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
