Don’t call me DOM

21 April 2005

Links annotater

Filed under:

The Web is a formidable tool to host documentation; nothing new about that.

But documentation, be it on the Web or not, tend to rot when not maintained. Nothing new about that either.

While documentation maintenance is probably better addressed at the social-engineering level, there are tools that can help manage it; namely, a few weeks ago, W3C Systems Team went through the process of cleaning up our internal documentation on processes, tools, services, configurations, etc. that sits on our Team-only Web site, but is too rarely kept up to date with the latest developments.

To help filling the gap, I wrote a small XSLT style sheet to annotate links with information on how recently linked pages were updated. The goal was to run it on our main documentation pages, and find pages that hadn’t been updated recently through a color code.

The idea behind this tool, as illustrated when applied to the QA Activity home page is to show the date of last modification of all pages linked from a given page. To make it easier to find the most outdated pages, it also uses a color code, from pale yellow (most recent) to red (oldest), to denote how recently the linked page was updated.

I think the tool is useful as is, although it could use some user interface polishing. It could also provide some ideas for new features in the W3C link checker, or the linkchecker Firefox extension.

Since it reveals several interesting aspects of developing with HTTP and XSLT, I’m going to give a bit more details on how it actually works.

Getting the information from HTTP

The links annotations rely on the Last-Modified HTTP header. This header, when set, indicates the date and time at which the origin server believes the [page] was last modified.

In other words, the annotations only give interesting results on servers that are properly configured to send this header; for a statically-served site on Apache (as most of W3C Web site), it is just a matter of enabling ExpiresActive On in the configuration file (see for other servers); it’s a bit more tricky to get it right for dynamically-generated pages, but still very much doable, and directly worth it in terms of getting a better caching behavior from browsers and proxies. In any case, Mark Nottingham’s caching tutorial is one of the best references on this topic.

Using XSLT to get and show the results

So, now that we know where the information is, how are we going to access it? Getting access to HTTP headers is usually pretty easy in most HTTP-aware programming languages, but this is not the case by default with XSLT. (Why using XSLT then? because it so damn easy to use for parsing XML!)

To circumvent this limitation, I’m using on the most powerful way to extend XSLT in my opinion: the document() function combined with HTTP GET.

Indeed, XSLT allows to import and process the content from other documents than the one being processed with the document() function that takes the URI of these other documents as parameter. As this URI can be created at runtime (i.e. can be the result of another expression), you can actually get the content of resources based on the main document being processed.

In this case, we’re going to use this flexibility to get the HTTP headers from a completely separate tool: as this tool uses HTTP GET to pass its parameter url and outputs the results in XHTML, one can parse in XSLT the results for any HTTP URI by concatenating the said URI to http://cgi.w3.org/cgi-bin/headers?url=, namely with document(concat('http://cgi.w3.org/cgi-bin/headers?url=',$uri)).

Since this capacity to access HTTP headers has been useful to me more than once, I created a while back a small XSLT interface that defines a set of named templates that I can import in new style sheets with <xsl:import>. Over the years, I have gathered a few of these interfaces that allow to considerably reduce the development of small or complex tools.

In practice, the code of the main XSLT does the following processing:

  1. by default, we keep everything as is through the identity transformation:

    <!-- default: Identity Transformation -->
    <xsl:template match="*|@*|comment()|text()">
       <xsl:copy>
         <xsl:apply-templates select="*|@*|comment()|text()"/>
       </xsl:copy>
     </xsl:template>

    (there are simpler ways to specify the identity transformation, but it would get caught on one of the bugs of the above mentioned XSLT servlet)

  2. since we’re interested on visible links to HTTP resources, we’re going to process each of these through:

    <xsl:template match="html:a[@href and (starts-with(@href,'http:') or not(contains(@href,':')))]">

    (i.e. applies what follows to all the a elements that have an href attribute which either starts with http: or doesn’t contain a URI scheme (and thus, are relative HTTP URIs)

  3. then, we transform the URI of the link into an absolute URI (using another imported XSLT)
  4. we check its HTTP status code to detect broken links, and for valid links, we actually extract the Last-Modification header
  5. the rest of the code only deals with displaying only the interesting information of this header (namely, the date, since the day of the week and the time are less likely to be useful) and associating it with an HTML class, so as to facilitate the color coding through CSS

Et voilà for the processing!

The last bit is to provide an HTML interface to it; to that end, I’m using a CSS trick on the XSLT style sheet itself (trick that doesn’t work in Internet Explorer, last time I heard). The XHTML interface is directly embedded inside the XSLT, as a child of the root element, where XSLT processors are not required to complain about foreign elements. Then, with a CSS style sheet, I make sure the content of the templates is not displayed, and that way by default, browsers will only display the embedded XHTML.

This wouldn’t prevent from doing a separate HTML interface as I have come to do for other tools; but I have always liked the idea (stolen from DanC as far as I can remember) of embedding HTML in my XSLT if only as a way to document what they are for, and how to use them.

One Response to “Links annotater”

  1. Zulfiqar Mansoor Says:

    Hi,
    Would it be possible to extract some information from HTTP Headers and format the HTML output accordingly using XSL/XSLT?

    I need guidance for the following:
    If a user (on an EVDO device) is browsing in an EVDO (fast) network, then the home page categories will be returned as a series of buttons on a grid. If a user is browsing on a slower network, then no grid and just standard list of options will be returned.

    Thus, I would like to extract network information from HTTP Headers and format my output accordingly. I am using Cocoon but completely new to it.

    Any help would be appreciated.

    Thanks.

    Zulfiqar

Picture of Dominique Hazael-MassieuxDominique Hazaël-Massieux (dom@w3.org) is part of the World Wide Web Consortium (W3C) Staff; his interests cover a number of Web technologies, as well as the usage of open source software in a distributed work environment.