2006-03-16
Attributes and Elements
I’m sometimes asked when you should use attributes and when you should use elements when designing a new XML vocabulary. There are in fact no hard and fast answers for all situations, but there are some constraints and guidelines that may help.
The attributes of any particular XML element are an unordered set of name-value pairs. There can be no duplicated names (hence a set), but duplicated values are allowed in XML. The values cannot contain element markup, although they can contain entity references and character references. These constraints all follow directly from the XML specification, which also allows you, via a DTD, to place some simple constraints on the values of attributes: this one is a list of name tokens, this one has a space-separated list of name tokens that must each appear elsewhere in the document as the value of an attribute declared in the DTD to be of type id, and so on. In practice the constraints on values that a DTD can impose are not very widely applicable, and people wanting to constrin content are more likely to be using W3C XML Schema to do so.
Conventions of usage outside the XML specification itself have led to some other observations, some of which were learned by the SGML folks before XML was even started.
Most of the time when you have running text that might be presented to a human user, you will run across the need for element markup in that text. Even book titles sometimes have fragments of mathematics in them, and in Japan even names can have markup (called Ruby). As a result, the first principle for deciding when to use an attribute is this:
Attributes are for computers and elements are for people.
If you have come to XML from HTML, you might be thinking of the img element and its alt attribute, or of the title attribute on a div or anchor. These are simply bad design. There is no reason why a link title should not contain an image, and no reason why replacement text for an image can’t contain formatting (consider a poem, for example!).
If you have come to XML from RDF, you might be thinking that all properties are strings and that elements in RDF serialisations denote a change of focus, the alternation between subject and relation. That’s because RDF serializations all suck, and is why markup must be escaped inside an RDFLiteral in RSS.
Wake up and smell the coffee, guys. Don’t put human-readable
text in attributes. Let elements nest naturally.
If attributes are for machine-readable content, what sort of tings are suitable? One approach is to say that attributes on an element are properties of that element. If you have a boat element, say, to describe a boat (Ballasted Open Aquatic Transportation device, for you military types), the attributes on that boat element would describe properties of the element, not properties of the boat. You might, for example, record the date on which the element’s contents were last modified.
Restricting attributes to be element properties quickly gets tedious, and in practice people intermix the model and the markup, and say, give their boat a length attribute that is intended to be the length of the boat in nautical miles (or in metres, or whatever). This rarely causes problems in practice, as long as it is clear to everyone involved which is which.
People coming to XML from the relational database world want to make serializations of database tables or views, and in those cases they typically do have what XML people think of as unstructured data: that is, plain strings without interior markup. Furthermore the columns in a relational database table have unique names and are not ordered, so it is entirely reasonable to have a row element with an attribute for each field name. I more often see sub-elements used here, and I am not entirely sure why, but probably because the database people think of the content of the individual cells as being data and want the data in content rather than in attributes, which they think of as representing properties. But if each table row represent an object, the values in the cells in the row are the defining properties of that object, so you could easily argue either position.
It then comes down to tools. Attributes tend to be second-class citizens in XML APIs. You can’t duplicate them with XInclude; you need to read and parse them all before you process any of them, in case there are namespace pseudo-attributes intertwingled amongst them; in a DTD you can’t define a content model for which attributes must or must not appear in the way that you can for elements.
So in practice people use attributes to mark up properties that modify the way we think of an element, and elements for human-readable content, and for other things people choose based on the tools they are using.
And then there are people who pervert attributes into element names, so instead of saying <part-number> they say <person name=”part-number“> or something like that. This practice has been ennobled recently with the name microformats, but really it’s a way for people to be able to extend their markup in an ad hoc way without having to think carefully about the consequences. If you are in control of the vocabulary you should of course not do this. If you are not in control of the vocabulary, you should consider using namespaces to introduce new elements instead of corrupting existing ones. Microformats are a last resort for people learning about markup in a stifling environment.
OK, that’s enough ranting for today. If I left you a little confused about when to use attributes, let me know, but remember that outside the firm rules that come from the XML specification you’re on your own: the XML specification doesn’t say any more because implementation experience with SGML showed us that it’s not always clear when to use elements and when to use attributes, so we left it up to people using XML to decide what made most sense for them.