The Web is a formidable tool to host documentation; nothing new about that.
But documentation, be it on the Web or not, tend to rot when not maintained. Nothing new about that either.
While documentation maintenance is probably better addressed at the social-engineering level, there are tools that can help manage it; namely, a few weeks ago, W3C Systems Team went through the process of cleaning up our internal documentation on processes, tools, services, configurations, etc. that sits on our Team-only Web site, but is too rarely kept up to date with the latest developments.
To help filling the gap, I wrote a small XSLT style sheet to annotate links with information on how recently linked pages were updated. The goal was to run it on our main documentation pages, and find pages that hadn’t been updated recently through a color code.
The idea behind this tool, as illustrated when applied to the QA Activity home page is to show the date of last modification of all pages linked from a given page. To make it easier to find the most outdated pages, it also uses a color code, from pale yellow (most recent) to red (oldest), to denote how recently the linked page was updated.
Since it reveals several interesting aspects of developing with HTTP and XSLT, I’m going to give a bit more details on how it actually works.
Getting the information from HTTP
The links annotations rely on the
Last-Modified HTTP header. This header, when set, indicates
the date and time at which the origin server believes the [page] was last modified.
In other words, the annotations only give interesting results on servers that are properly configured to send this header; for a statically-served site on Apache (as most of W3C Web site), it is just a matter of enabling
ExpiresActive On in the configuration file (see for other servers); it’s a bit more tricky to get it right for dynamically-generated pages, but still very much doable, and directly worth it in terms of getting a better caching behavior from browsers and proxies. In any case, Mark Nottingham’s caching tutorial is one of the best references on this topic.
Using XSLT to get and show the results
So, now that we know where the information is, how are we going to access it? Getting access to HTTP headers is usually pretty easy in most HTTP-aware programming languages, but this is not the case by default with XSLT. (Why using XSLT then? because it so damn easy to use for parsing XML!)
To circumvent this limitation, I’m using on the most powerful way to extend XSLT in my opinion: the
document() function combined with HTTP GET.
Indeed, XSLT allows to import and process the content from other documents than the one being processed with the
document() function that takes the URI of these other documents as parameter. As this URI can be created at runtime (i.e. can be the result of another expression), you can actually get the content of resources based on the main document being processed.
In this case, we’re going to use this flexibility to get the HTTP headers from a completely separate tool: as this tool uses HTTP GET to pass its parameter
url and outputs the results in XHTML, one can parse in XSLT the results for any HTTP URI by concatenating the said URI to
http://cgi.w3.org/cgi-bin/headers?url=, namely with
Since this capacity to access HTTP headers has been useful to me more than once, I created a while back a small XSLT interface that defines a set of named templates that I can import in new style sheets with
<xsl:import>. Over the years, I have gathered a few of these interfaces that allow to considerably reduce the development of small or complex tools.
In practice, the code of the main XSLT does the following processing:
by default, we keep everything as is through the identity transformation:
<!-- default: Identity Transformation --> <xsl:template match="*|@*|comment()|text()"> <xsl:copy> <xsl:apply-templates select="*|@*|comment()|text()"/> </xsl:copy> </xsl:template>
(there are simpler ways to specify the identity transformation, but it would get caught on one of the bugs of the above mentioned XSLT servlet)
since we’re interested on visible links to HTTP resources, we’re going to process each of these through:
<xsl:template match="html:a[@href and (starts-with(@href,'http:') or not(contains(@href,':')))]">
(i.e. applies what follows to all the
aelements that have an
hrefattribute which either starts with
http:or doesn’t contain a URI scheme (and thus, are relative HTTP URIs)
- then, we transform the URI of the link into an absolute URI (using another imported XSLT)
- we check its HTTP status code to detect broken links, and for valid links, we actually extract the
- the rest of the code only deals with displaying only the interesting information of this header (namely, the date, since the day of the week and the time are less likely to be useful) and associating it with an HTML class, so as to facilitate the color coding through CSS
Et voilà for the processing!
The last bit is to provide an HTML interface to it; to that end, I’m using a CSS trick on the XSLT style sheet itself (trick that doesn’t work in Internet Explorer, last time I heard). The XHTML interface is directly embedded inside the XSLT, as a child of the root element, where XSLT processors are not required to complain about foreign elements. Then, with a CSS style sheet, I make sure the content of the templates is not displayed, and that way by default, browsers will only display the embedded XHTML.
This wouldn’t prevent from doing a separate HTML interface as I have come to do for other tools; but I have always liked the idea (stolen from DanC as far as I can remember) of embedding HTML in my XSLT if only as a way to document what they are for, and how to use them.