The spammers have striken again, and we received reports that one of our extremely useful public service was used to work around URL matching techniques for spammers. In other words, a spammer who would have been identified (in email messages, blog comments) as using
http://example.net/ as a URI in his spam could workaround it by putting a link to
http://our-useful-service.example.org?uri=http://example.net/ instead, and given that the said service more or less entirely preserves the content as is, this allowed indeed to put a link to the incriminated content.
The reporter of this abuse of service had the good idea to mention an existing technical solution to this type of problem: SURBL is a registry of registered domain names that have been reported as used by spammers. Although I’m not a big fan of this type of registries (they really seem the lowest type of trust network one can imagine, too easily abused), faced with the alternative of shutting down the service entirely or reducing the possibility of abusing it, I took the second option.
As the said service (that I’m not mentioning explicitely in case it would draw more attention that it needs at this point) is written in Python, I’ve been looking for a Python implementation of the SURBL mechanism; unfortunately, I haven’t been able to found one and so had to write my own implementation, which seems to work well (units tests also helped) and now supports the alluded on-line service.
SURBL uses DNS queries as a way to query their registry: you ask whether
spam.example.multi.surbl.org exists; if it does, then
spam.example is part of the blacklist, otherwise it isn’t. In terms of implementation, the only (rather small) difficulty is to identify the relevant part of the URI you want to check, namely the registered name used in the authority component of the URI. This implies removing the possible port and user information parts in the authority component, but also the possible sub-domains of a registered domain; this would be entirely trivial if one didn’t have to take into accounts delegated second-top-level domain names (e.g. as
SURBL’s architecture is a rather smart way to re-use the existing caching/query infrastructure deployed for DNS (e.g. I just had to import DNS Python to query their blacklist). I suspect that the software infrastructure behind DNS implements caching more widely than most implementations of HTTP do, and thus would induce a smaller load on their servers – although I haven’t quite checked it. Of course, the DNS system is particularly well fitting to this, given that the queried items follow (by definition) the naming rules of DNS. Also, I guess that the primary usage of the system (mail filtering) made DNS a pre-requisite, while access to an HTTP client may be less obvious. But I wonder if there would be any compelling reasons to make this also available as an HTTP service?