This document may contain examples in another language or script.
Use accesskey "n" to jump to the internal navigation links at any point. Right now you can
On this page I explore some thoughts on requirements for XML editors that will support editing of markup for languages such as Arabic, Hebrew and Farsi, that involve right-to-left and bidirectional text.
As a shorthand for 'Arabic, Hebrew and Farsi, and other similar scripts' I will use the term 'RTL scripts' in this article. All of these scripts exhibit significant amounts of bidirectional behaviour because of the directionality of numbers and embedded text in non-RTL scripts.
A basic requirement is to support the Unicode bidirectional algorithm for ordering inline text. For an introduction to what this involves see What you need to know about the bidi algorithm and inline markup. For detailed information, see the Unicode Standard.
An editor will also have to support correct joining and shaping for Arabic script text.
Mention support for visible joiner and non-joiner characters, and any others, and need for correct display of escaped versions.
I think it would be expected that a document or block that is declared to be right to left would be right-aligned within the enclosing block or page, and vice-versa. Such declarations would be dependent, I think, on markup. This presupposes that the editor is able to detect markup that declares a document or block to have a particular directionality.
In some situations the bidi algorithm is insufficient to produce the desired result. At this point it is necessary to use Unicode control characters or markup. Note that markup is recommended by the W3C, although there are situations where the Unicode control characters are also need (eg. in attribute values).
The Unicode control characters include:
The first five should ideally have markup counterparts, and should ideally only be used where markup cannot be (eg. in attribute values or PCDATA-only elements). The last two are likely to be needed even if markup is used in place of the others.
The W3C recommends the use of dedicated markup to indicate bidi information, rather than attaching CSS rules to arbitrary elements or attributes. People may still want to do the latter, however.
An editor would need to understand the markup it is dealing with in terms of the intended bidirectional effect. This means that it is necessary to provide a means of associating expected behaviour with markup elements.
Current thinking seems to be that dedicated markup would consist of a direction-related attribute that could specify directional embedding (LTR or RTL) and directional overrides (again LTR or RTL). For an example of this, see the XHTML 2.0 Working Draft. Note, however, that the proposed ITS (International Tag Set) Working Group at the W3C will seek to publish a definitive set of tags for people to use.
Whatever markup is supported, the CSS properties in the XHTML 2 WD probably provide useful descriptions of behaviour to associate with markup in the user's format - so that the editor knows how to handle the text.
An XML editor that supports RTL scripts must therefore be capable of displaying text that inherits the behaviour described by markup as well as that described by any embedded Unicode control characters.
Markup in the editor must also be capable of inheriting bidi behaviour expressed higher in the tag hierarchy.
It might be useful for an editor to issue a warning where markup and Unicode characters overlap - especially if they contradict each other.
One problem associated with the use of Unicode control characters is that they are invisible. It would be very helpful if the user could set a preference that allowed them to see a visual representation of embedded characters. It would also be important to distinguish one character from another.
The Unicode characters can be represented by escapes, including numeric character references and entities. When escapes are used in bidirectional text, undesirable results can arise.
Suppose you have the following characters, shown as ordered in memory (first to the left, last to the right), where XXX represent Arabic or Hebrew characters, and the overall context is right-to-left:
W3C ‏(World Wide Web Consortium) XXX
What you would expect to see displayed in the markup is:
XXX (World Wide Web Consortium)‏ W3C
What you would typically see after applying the bidi algorithm, however, is:
XXX (W3C ‏(World Wide Web Consortium
The text ‏ has to be recognised as an RLM.
Note that if the ‏ appears between text of different directionality, there is the opportunity for an even worse outcome. For example, given the following sequence of characters in memory
Web-‎ XXX
You would expect to see:
XXX Web-‎
But would actually get:
XXX ;Web-lrm
And a ‎ in the middle of two RTL directional runs would look like:
XXX ;lrm& XXX
The text that makes up the entity, or the numeric character reference, must therefore be treated as a single, indivisible unit, that produces the expected effect of a Unicode character on the surrounding text.
This is perhaps easier to achieve if you are using 'tags-on-display' mode, where the entity is represented as an object, rather than text,
eg.
or
. But one would almost certainly also have
to handle the display of pure code views too.
Note also that, since an entity intended to produce this effect could have any name, there needs to exist some mechanism to associate the entity with the Unicode control character so that the editor understands what the expected behaviour is.
Extrapolate this to other types of escape.
Go through similar scenario for tags. Start and end tag may need to be switched visually. Attributes may need to be displayed to the left of the tag name.
Consider any requirements for attribute values.
Consider any requirements for tags in RTL scripts.
The i18n test pages may be useful for assessing further requirements and testing behaviour. May need adaptation.
Also, consider display of file names in window title bar. Display in pop ups, etc.
I18N Test Suite: Using link for alternative language versions of document http://www.w3.org/International/tests/sec-link.html
Description of link in HTML specification http://www.w3.org/TR/html401/struct/links.html#h-12.3.3
Link toolbar extension for Firefox 0.9 and above http://texturizer.net/firefox/extensions/#linktoolbar
Other W3C I18N resources relating to Language http://www.w3.org/International/resource-index#lang
Content created 3 November, 2004. Last update 2004-11-03 16:21 GMT
Copyright © 2004 Richard Ishida, All Rights Reserved.