Don’t call me DOM

22 September 2004

XHTMLizer on steroids

Filed under:

One of the coolest things with XHTML is that it is an XML language, so you can apply any kind of XML tools to it.

One of the terrible thing with XHTML (and XML more generally) is how hard it is sometimes to get it right.

One of the depressing thing with building tools based on XML for Web technologies is that most of the content out there is in HTML (or the tag soup that some people call with that name), or in ill-formed XHTML.

For quite some time, I have been using our tidy on-line service as a way to get proper XHTML from any kind of HTML/XHTML input; but as good as it was, it still didn’t guarantee that the output would be well-formed XML; for instance, if there were characters in the input document that were out of the accepted range of XML characters, the underlying software (tidy) would leave them as is in the output, which would make any XML-compliant tool refuse to process it.

That’s where xmllint comes into play, with its --recover option that ensures that what you get as output is XML well-formed – with the potential cost of dumping part of your XML tree on the floor.

Having added it to tidy on-line, I can now be sure that using the proper option, I will indeed get well-formed XML as output. A little step for humanity, a cool thing for me.

Comments are closed.

Picture of Dominique Hazael-MassieuxDominique Hazaël-Massieux (dom@w3.org) is part of the World Wide Web Consortium (W3C) Staff; his interests cover a number of Web technologies, as well as the usage of open source software in a distributed work environment.