Don’t call me DOM

6 July 2005

Beautifying URIs

Filed under:

It’s not because URI are opaque that they should be ugly. Call me a Web-purist, but I manage to have opinions on the esthetics of URIs.

The single most ugly part of a URI is usually the query component (i.e. the part after the question mark), where Web-based applications communicate a varying number of parameters of diverse importance. The most widespread ugliness is probably these sites where any single page is with a URI à la index.php?page=welcome.

My number one thumbrule when I design the URI space for a Web-application based on PHP (or any other scripting language) is to determine whether the relative importance of the parameters communicated through HTTP GET. Typically, some parameters are essential to the content of the page (generally the identifier of an object) while others only affect the way the information is presented (e.g. sorting the data in a different oerder). Although there is sometimes subjectivity in this characterization, once it is done, the rule is to use the query component of the visible URI only to communicate the minor parameters, and stick the other parameters in the path component, using its hierarchical nature as fitting.

For instance, WBS, a Web questionnaires system I developed for W3C, puts the important information in the path component; typically, even though the content of a questionnaire happens to be rendered by a script called vote.php3 taking a wgid and a qaireno parameter, the end user only see a URI à la http://www.w3.org/2002/09/wbs/wg/qaire/ when using the system. Given that a given questionnaire (identified by its qaireno) is always bound to a given group (identified by its wgid), it made sense to use the hierarchical nature of the path component to reflect this.

But how does this work? Using the single best and ugliest feature in Apache, the URL rewriting module. For instance, the WBS example above is put in action using a .htaccess file:


# the directives needed to activate the rewrite module in the given directory
RewriteEngine On
RewriteBase /2002/09/wbs/
# and the actual rewrite rule: first matches a number, then a slash, then a string, then a slash
# and pass internally the number as wgid parameter, the string as qaireno parameter to the script vote.php3
RewriteRule ^([0-9]*)/([^/]*)/$ http://www.w3.org/2002/09/wbs/vote.php3?qaireno=$2&wgid=$1 [P,L]

Besides fighting ugliness, this helps tremendously when you want to have cool URIs that don’t change if for some reason you need to change the name of a parameter, or of the script, etc.

While this works fairly well for applications with a hierarchical repartition of parameters (and fortunately, many blogging tools now uses RewriteRules to achieve a nice URI layout), there are some applications where this isn’t such a thing as hierarchical data, and using the query component is really the best you can do. A typical use case for this is searching; when searching only on one keyword, the query component doesn’t grow too badly generally, but as soon as you start dealing with complex searches and many parameters, the query string starts growing endlessly, generating terrible URIs à la http://example.net/advanced-search?param1=foo&param2=bar&param3=blalala&param4=&param5=&param6=.

While there isn’t much you can do about it, one thing that I have found useful lately is to at least clean up the URIs from all the parameters that aren’t actually used. In the example above, param4, param5, and param6 don’t actually pass any information (e.g. because nothing was input in the form for the corresponding controls) and could be removed from the URI without any loss. While one can deal with this on the client side (with XForms when possible, Javascript otherwise using the disabled state of HTML controls), it is also possible to deal with it on the server side, using HTTP redirects.

The code below does just this for PHP scripts:


// Turn the PHP  _GET array into a query string
// while removing any empty parameter
// (this is separated from clean_get() as it can be re-used in other contexts
// e.g. creating links to next/previous results in a search query)
function from_GET_to_qs() {
  // Let's find out what parameters were passed in the request
  $params = array_keys($_GET);
  $qs = "";
  // for each paramer, if it is not empty, we add it to the query string
  for ($i=0;$i<count($params);$i++) {
    if (!empty($_GET[$params[$i]])) {
       // We need to deal separately with parameters that are transmitted as arrays
       // See "array from form element" "feature"
      if (!is_array($_GET[$params[$i]])) {
        $qs .= urlencode($params[$i])."=".urlencode($_GET[$params[$i]]);
      } else {
        foreach($_GET[$params[$i]] as $key => $value) {
          $qs .= urlencode($params[$i]."[$key]")."=".urlencode($value)."&";
        }
      }
      if ($i!=count($params) - 1) {
        $qs.="&";
      }
    }
  }
  return $qs;
}

// Redirects a GET with plenty of empty fields to a nicer URI
function clean_get() {
  // We only do a redirect if there is indeed anything empty in the query string
 if (array_search("",$_GET)) {
    $qs = from_GET_to_qs();
    header("Status: 301 Redirect");
    header("Location: ".$_SERVER['PHP_SELF']."?".$qs);
  }
}

By placing a call to clean_get() at the start of a PHP script, the user will always be redirected to URIs as short as possible for the given parameters.

7 Responses to “Beautifying URIs”

  1. gregR Says:

    Very useful Dominique, I’m sure that not every user of blogging softwares understands rewrtiting rules and how they works.

    Now I have a question, I implemented a modified (simplified to be honest!) version of a web-service created by J. Gregorio : Sparklines : http://bitworking.org/projects/sparklines/
    Sorry, no URI to provide because of a restricted access.

    Written in php, it takes some GET variables to produce an image with gd, my uri’s are actually
    http://www.example.com/generator.php?type=line&data=-10,15,2.5&height=15&step=4
    for lines graphics and
    http://www.example.com/generator.php?type=bar&data=-10,15,2.5&height=15&width_b=3&treshold=0&abovecolor=red&belowcolor=blue
    for bar graphics

    For every “type” of graphics, there are common parameters “data” and “height” and particular parameters :
    “step” for the line “type”
    “treshold”, “abovecolor”, “width_b” and “belowcolor” for the bar “type”.

    It seems kind of hierarchical, but to use mod_rewrite, do I need to have several rewriting rules in the same .htaccess ?

  2. dom Says:

    Hi Greg, I’m not sure the example you give is really using hierarchical data, with the possible exception for “line” vs “bar” graphics. For this one, I would have my URIs generated à la http://www.example.com/graphics/bar?data=-10,16,2.5&height….
    To that end, you could use the following rewrite rule:

    RewriteEngine On
    RewriteRule /graphics
    RewriteRule ^bar$ /generator.php?type=bar [P,QSA]

    P here stands for Proxy (and will work only if Apache has mod_proxy available) and will thus hides the sub-request. QSA stands for Query String Appending, and means that the query string received on the original resource should be appended when the proxy-redirect is done.

  3. gregR Says:

    Thanks for your rapid response Dom, I’ll try this tomorrow and give you the result.

  4. Mark Nottingham Says:

    Good stuff!

    The only caveat I have is that some people will assume that because they have “pretty” URIs, they somehow have a more Web-friendly app. While the URIs look better, they don’t get other benefits like caching, because it’s still script-based, and doesn’t supply a validator for caches to reuse.

    That isn’t to say that this shouldn’t be done, of course; only that more is needed to get parity with filesystem-based resource (which usually do supply Last-Modified, ETag, etc.).

    shameless plug
    One way to do this with PHP can be found at: http://www.mnot.net/cgi_buffer/
    /shameless plug

    Cheers,

  5. 虚拟主机 Says:

    Thanks for the information. This is very useful

  6. Joshua Ferraro Says:

    Of course, another way to remove empty parameters is with mod_redirect … here’s a recipe I’ve used with some success:

    RewriteCond %{QUERY_STRING} (.*?)(?:[A-Za-z0-9_-]+)=&(.*)
    RewriteRule (.+) $1?%1%2 [N,R,NE]

  7. paolo Says:

    Very useful. Thank you. I am trying Joshua’s suggestion for the moment.
    I need to change the RewriteRule to:
    RewriteRule (.+)/ $1?%1%2 [N,R,NE] (with a slash added)

    But I wonder if it would be better to do it in PHP as the article suggests? And, if so, what is the best way of connecting the function above to the submit button on a form?

Picture of Dominique Hazael-MassieuxDominique Hazaël-Massieux (dom@w3.org) is part of the World Wide Web Consortium (W3C) Staff; his interests cover a number of Web technologies, as well as the usage of open source software in a distributed work environment.