Don’t call me DOM

26 April 2005

Named versus Numeric Entities

Filed under:

I was kindly notified that my RSS feed was ill-formed for the past few days, because of an — entity in one of my previous posts. While I have fixed manually the issue, I was asked in return why this entity made my feed invalid. I’m taking this question as an opportunity to start a new category on tutorials about markup languages; it may end up being only one article only, but I’m interested in getting more questions for this category if people find it useful (let me know through comments or mail).

So, why would this — entity make my RSS feed ill-formed and not my XHTML post?

XML (and SGML before it) allows authors to include any character from the Unicode character set (also known as ISO/IEC 10646) either directly as a character encoded as defined in the encoding set in the XML declaration, or as a so called character references of the form &#characterCode;, often called numeric entities.

Why two ways of inserting a character? There are two main situations where entering directly the character is not possible or practical:

  • the input device doesn’t allow to enter the given character, or not in an easy way (try entering a Japanese character in a Western keyboard!)
  • the encoding declared for the given XML document only encodes a reduced character set that doesn’t include the targeted character — as a matter of fact, most encodings maps a reduced character set; the beauty of UTF-8 and other UTF-* is to map the full Unicode character set

For instance, the character e with an acute accent can be directly entered as “é” if your input device allows it and if “é” is available in the declared encoding; otherwise, you can insert it as é, since 233 is the code for “é” in Unicode.

In a syntax similar to character references, SGML and XML also define named entities, of the form &name;. These named entities can stand for (almost) any type of content, but their most well-known usage is to serve as replacement for character references. HTML defines three set of named entities (one mapping the ISO Latin 1 character set, another mapping a set of symbols, and one mapping a set of “special” characters such as quotes, euro sign, etc). For instance, in HTML, you can use — to insert a middle dash (not easily available on any keyboard I know) instead of —; both are correct of course, but the former is much easier to remember than the latter for most people.

But the important thing to remember with regard to named entities is that they are bound to a DTD; more precisely, one can only use a named entity in a document that explicitly refers to a DTD that defines the given entity; otherwise, in the best case, the parser will simply not make sense of the said entity, and in the worst case (depending on the parser type), the parser will completely reject the full document.

In other words, one can only use — in an XML document that declares that this named entity is equivalent to “—”; any valid (X)HTML document has a doctype that references the named entities declarations, so this isn’t a problem. But for most other XML documents (e.g. my RSS feed), this entity does not represent anything and makes the document as a whole invalid. For my own case, my blogging tool should have replaced the named entity with either the actual character or a numeric entity when generating the RSS feed; I haven’t fully explored yet why it doesn’t, but that’s definitely a bug.

Note that named entities are one of the reasons why DTD are still used in XHTML or MathML; there is no other schema technology that defines a way to create named entities, and while there has been a lot of discussion on how to get rid of DTDs in XML, named entities are still one of requirements that are hard to get around, due to their popularity for hand-authored documents.

(As an authoring shortcut, numeric and named entities are so useful that some have proposed to extend their usage for any kind of text, relying on a predefined set of useful names).

8 Responses to “Named versus Numeric Entities”

  1. gregR Says:

    Excellent ! Thanks for sharing with us your knowledge

  2. Jon Says:

    That’s not a “middle dash”, it’s an “em dash”, named after the letter M, as it should be the same width as an M in the same font. There’s also an “en dash”. See the link for more info.

    Also, as far as HTML, I believe browser support for named entities is not as good as for numbered entities, which irritates me.

  3. zed Says:

    Well, you’ve touched the subject of magic things… Everyone write about XHTML, DOM and such. But noone could help me with this: let’s create true XHTML document in php. Be careful – you cannnot do this without server side script! Did you know? It doesn’t matter how is “content-type” meta set in your HTML header. When browser read this file – it reads it as previously defined content-type. Content-type is really defined via HTTP header send by server. So in php we can generate such header and send it. Like header(‘content-type: application/xhtml+xml; charset=utf-8′). Good. Now we can really send XHTML. It MUST be well-formed, or every good browser won’t display it. So… Let’s put an innocent looking “ ” into our XHTML. And what happened? Why doesn’t it display? Not in Firefox. Not in Opera. W3C validator says the document is well-formed. What’s more, nbsp and other entities ARE really defined in DTD which is pointed in DOCTYPE. What’s more, you can copy and paste DTD URL and check yourself, the named entities ARE REALLY DEFINED THERE. W3C validator sees them. Browsers don’t. The only way I’ve found is… Encode all named entities as numeric entities. Email me if I’m wrong, but I think it’s because of that most browsers DON’T really support XHTML+XML content. Their parsers are BUGGY and cannot make any use of DTD URL in DOCTYPE. Before yesterday – I thought it’s no other way out except quit trying to use XHTML (by serving it without defining proper content-type header). The solution came accidentaly. I’ve imported a DOM node from one HTML file to another. It contained nbsp-entity. And the result file displayed well in Opera with XML header, which, as I thought, was impossible. So i checked the source and found that nbsp-entity was converted to numeric by DOM-tree processing code.

  4. Resuna Says:

    To summarize the summary: XML is a problem.

  5. BobH Says:

    Just a note on your reference to “middle dash.” It’s really “em dash.” An “em space” or “em” comes from old-fashioned typesetting with lead type. An em is a square with each edge the size of the type — typically 10 points, or 10/72 of an inch for medium-sized text. The unit carries forward into CSS today. An em was used as a standard paragraph indent, and an em dash was supposed to be as wide as an em space. An en space is half the width of an em space, and an en dash is half the width of an em dash — still wider than a hyphen, which people often mistakenly call a dash. A thin space is still narrower — one third the width of an em space.

  6. Michael Says:

    I usually type in the Unicode character directly, finding it easier to read, but I prefer named entities in a few cases: one is where the character is not sufficiently distinct in it’s rendered form. For example, who can tell the difference between a non-breakable space and an ordinary space or even an en space just by looking at it?

    Perhaps someone could write a simple Perl or Javascript to translate between the named and numbered entities to make the XML source easier to read.

    XML is not as much of a problem as HTML. With XML, it is much easier to do semantic markup. That means that the user can do much more processing on the page, e.g., to extract information. HTML is mostly about presentation. We need to get beyond the limited notion that web pages are purely visual.

    Finally, an em dash is usually wider than the letter M. It was named after M, but that should not be taken literally. It is usually as wide as the type size, i.e., in 12 point type an em is 12 points wide, but the type designer can make a little wider or a little narrower, if he finds that that will look better for that font.

  7. Blog to Email Subscriptions using CMS Made Simple and Feedburner | Website Design Blog Says:

    [...] Tell TinyMCE to use numeric encoding of entities. In CMSMS, go to Extensions -> TinyMCE WYSIWYG. Under the Advanced tab we need to select Numeric encoding in the “Encoding of Entities” dropdown box. We need to do this because named entities prevent RSS feeds from validating. For more info read Named versus Numeric Entities. [...]

  8. m » Blog Archive » Accentate, rss, php Says:

    [...] with RSS before I’m sure you understand the dilemma. For those of you that don’t know, RSS has a problem with validating if there are non-numeric HTML entities (the normal output you get from htmlentities()). More headaches, more XML [...]

Picture of Dominique Hazael-MassieuxDominique Hazaël-Massieux (dom@w3.org) is part of the World Wide Web Consortium (W3C) Staff; his interests cover a number of Web technologies, as well as the usage of open source software in a distributed work environment.