I was kindly notified that my RSS feed was ill-formed for the past few days, because of an
— entity in one of my previous posts. While I have fixed manually the issue, I was asked in return why this entity made my feed invalid. I’m taking this question as an opportunity to start a new category on tutorials about markup languages; it may end up being only one article only, but I’m interested in getting more questions for this category if people find it useful (let me know through comments or mail).
So, why would this
— entity make my RSS feed ill-formed and not my XHTML post?
XML (and SGML before it) allows authors to include any character from the Unicode character set (also known as ISO/IEC 10646) either directly as a character encoded as defined in the encoding set in the XML declaration, or as a so called character references of the form
&#characterCode;, often called numeric entities.
Why two ways of inserting a character? There are two main situations where entering directly the character is not possible or practical:
- the input device doesn’t allow to enter the given character, or not in an easy way (try entering a Japanese character in a Western keyboard!)
- the encoding declared for the given XML document only encodes a reduced character set that doesn’t include the targeted character — as a matter of fact, most encodings maps a reduced character set; the beauty of UTF-8 and other UTF-* is to map the full Unicode character set
For instance, the character e with an acute accent can be directly entered as “é” if your input device allows it and if “é” is available in the declared encoding; otherwise, you can insert it as
é, since 233 is the code for “é” in Unicode.
In a syntax similar to character references, SGML and XML also define named entities, of the form
&name;. These named entities can stand for (almost) any type of content, but their most well-known usage is to serve as replacement for character references. HTML defines three set of named entities (one mapping the ISO Latin 1 character set, another mapping a set of symbols, and one mapping a set of “special” characters such as quotes, euro sign, etc). For instance, in HTML, you can use
— to insert a middle dash (not easily available on any keyboard I know) instead of
—; both are correct of course, but the former is much easier to remember than the latter for most people.
But the important thing to remember with regard to named entities is that they are bound to a DTD; more precisely, one can only use a named entity in a document that explicitly refers to a DTD that defines the given entity; otherwise, in the best case, the parser will simply not make sense of the said entity, and in the worst case (depending on the parser type), the parser will completely reject the full document.
In other words, one can only use
— in an XML document that declares that this named entity is equivalent to “—”; any valid (X)HTML document has a doctype that references the named entities declarations, so this isn’t a problem. But for most other XML documents (e.g. my RSS feed), this entity does not represent anything and makes the document as a whole invalid. For my own case, my blogging tool should have replaced the named entity with either the actual character or a numeric entity when generating the RSS feed; I haven’t fully explored yet why it doesn’t, but that’s definitely a bug.
Note that named entities are one of the reasons why DTD are still used in XHTML or MathML; there is no other schema technology that defines a way to create named entities, and while there has been a lot of discussion on how to get rid of DTDs in XML, named entities are still one of requirements that are hard to get around, due to their popularity for hand-authored documents.
(As an authoring shortcut, numeric and named entities are so useful that some have proposed to extend their usage for any kind of text, relying on a predefined set of useful names).