Don’t call me DOM

28 April 2005

WordPress and Named Entities

Filed under:

As I mentioned a few days ago, my blogging tool, WordPress 1.5, doesn’t deal with named entities as it should. Namely, when fed with named entities, it outputs them as is in any context. But if named entities are fine in (X)HTML, they’re not with the various other flavors of RSS/RDF, where these entities cannot be parsed.

It didn’t take much time to see that other people had the same problem as I had; fortunately, it didn’t take much time either to learn how to do a plugin for WordPress.

So, since I have two blogs running with wordpress that I don’t expect to update to a more recent version anytime soon, I first coded up a plugin that converts HTML named entities into numeric entities. I’ve applied it to this very instance of WordPress, and it seems to work as expected.

And since a plugin that unbreaks something doesn’t feel very right, I also made a proper patch that hopefully will fix the problem in future versions of wordpress, if it is applied to the trunk — I have created a bug report to that effect.

A few notes on the PHP code used to implement this change:

  • I was first hoping that a simple function call à la html_entity_decode($content,ENT_NOQUOTES,get_settings('blog_charset')) would do the right thing; unfortunately, a bug in PHP 4.x makes this fail for UTF-8, which the encoding all my blogs run in (and I expect most wordpress blogs do)
  • then I was hoping to re-use the existing HTML entities table available in PHP through get_html_translation_table(HTML_ENTITIES), but then again, that table doesn’t have all the named entities defined in HTML! A quick count shows that it has 107 known entities, when HTML defines 253 of them; I really can’t tell why this is so, and haven’t found a relevant bug report on this yet (although someone already reported that not all named entities were in the table)
  • so, as ultimate solution, I had to build the mapping between named entities and their numeric equivalents for myself; it was easy enough to extract it from the HTML DTD itself using: less /usr/share/xml/entities/xhtml/*.ent|grep '^<!ENTITY'|sed -e 's/^<\!ENTITY[ \t]*\([A-Za-z0-9]*\)[ \t]*"&#\([0-9]*\);".*$/"\1"=>\2,/'
  • the rest of the coding was then completely trivial

The good news about all this is that it also made me discover how easy it is to create plugins for WordPress, so inspiration helping, I’ll be able to create those quite quickly from now on…

2 Responses to “WordPress and Named Entities”

  1. dom Says:

    Apparently, a similar patch had already been submitted, as well as a similar plugin… Oh well!

  2. dom Says:

    And wordpress 1.5.1 has integrated the said patch, so this problem is no longer one with a recent version of wordpress.

Picture of Dominique Hazael-MassieuxDominique Hazaël-Massieux (dom@w3.org) is part of the World Wide Web Consortium (W3C) Staff; his interests cover a number of Web technologies, as well as the usage of open source software in a distributed work environment.