Dochula Pass, Bhutan

Factoids listed at the start of the EURid/UNESCO World Report on IDN Deployment 2013

5.1 million IDN domain names

Only 2% of the world’s domain names are in non-Latin script

The 5 most popular browsers have strong support for IDNs in their latest versions

Poor support for IDNs in mobile devices

92% of the world’s most popular websites do not recognise IDNs as URLs in links

0% of the world’s most popular websites allow IDN email addresses as user accounts

99% correlation between IDN scripts and language of websites (Han, Hangkuk, Hiragana, Katakana)

About two weeks ago I attended the part of a 3-day Asia Pacific Top Level Domain Association (APTLD) meeting in Oman related to ‘Universal Acceptance’ of Internationalized Domain Names (IDNs), ie. domain names using non-ASCII characters. This refers to the fact that, although IDNs work reasonably well in the browser context, they are problematic when people try to use them in the wider world for things such as email and social media ids, etc. The meeting was facilitated by Don Hollander, GM of APTLD.

Here’s a summary of information from the presentations and discussions.

(By the way, Don Hollander and Dennis Tan Tanaka, Verisign, each gave talks about this during the MultilingualWeb workshop in Madrid the week before. You can find links to their slides from the event program.)

Basic proposition

International Domain Names (IDNs) provide much improved accessibility to the web for local communities using non-Latin scripts, and are expected to particularly smooth entry for the 3 billion people not yet web-enabled. For example, in advertising (such as on the side of a bus) they are easier and much faster to recognise and remember, they are also easier to note down and type into a browser.

The biggest collection of IDNs is under .com and .net, but there are new Brand TLDs emerging as well as IDN country codes. On the Web there is a near-perfect correlation between use of IDNs and the language of a web site.

The problems tend to arise where IDNs are used across cultural/script boundaries. These cross-cultural boundaries are encountered not just by users but by implementers/companies that create tools, such as email clients, that are deployed across multilingual regions.

It seems to be accepted that there is a case for IDNs, and that they already work pretty well in the context of the browser, but problems in widespread usage of internationalized domain names beyond the browser are delaying demand, and this apparently slow demand doesn’t convince implementers to make changes – it’s a chicken and egg situation.

The main question asked at the meeting was how to break the vicious cycle. The general opinion seemed to lean to getting major players like Google, Microsoft and Apple to provide end-to-end support for IDNs throughout their produce range, to encourage adoption by others.

Problems

Domain names are used beyond the browser context. Problem areas include:

  • email
    • email clients generally don’t support use of non-ascii email addresses
    • standards don’t address the username part of email addresses as well as domain
    • there’s an issue to do with smptutf8 not being visible in all the right places
    • you can’t be sure that your email will get through, it may be dropped on the floor even if only one cc is IDN
  • applications that accept email IDs or IDNs
    • even Russian PayPal IDs fail for the .рф domain
    • things to be considered include:
      • plain text detection: you currently need http or www at start in google docs to detect that something is a domain name
      • input validation: no central validation repository of TLDs
      • rendering: what if the user doesn’t have a font?
      • storage & normalization: ids that exist as either IDN or punycode are not unique ids
      • security and spam controls: Google won’t launch a solution without resolving phishing issues; some spam filters or anti-virus scanners think IDNs are dangerous abnormalities
      • other integrations: add contact, create mail and send mail all show different views of IDN email address
  • search: how do you search for IDNs in contacts list?
    • search in general already works pretty well on Google
    • I wasn’t clear about how equivalent IDN and Latin domain names will be treated
  • mobile devices: surprisingly for the APTLD folks, it’s harder to find the needed fonts and input mechanisms to allow typing IDNs in mobile devices
  • consistent rendering:
    • some browsers display as punycode in some circumstances – not very user friendly
    • there are typically differences between full and hybrid (ie. partial) int. domain names
    • IDNs typed in twitter are sent as punycode (mouse over the link in the tweet on a twitter page)

Initiatives

Google are working on enabling IDN’s throughout their application space, including Gmail but also many other applications – they pulled back from fixing many small, unconnected bugs to develop a company wide strategy and roll out fixes across all engineering teams. The Microsoft speaker echoed the same concerns and approaches.

In my talk, I expressed the hope that Google and MS and others would collaborate to develop synergies and standards wherever feasible. Microsoft, also called for a standard approach rather than in-house, proprietary solutions, to ensure interoperability.

However, progress is slow because changes need to be made in so many places, not just the email client.

Google expects to have some support for international email addresses this summer. You won’t be able to sign up for Arabic/Chinese/etc email addresses yet, but you will be able to use Gmail to communicate with users on other providers who have internationalized addresses. Full implementation will take a little longer because there’s no real way to test things without raising inappropriate user expectations if the system is live.

SaudiNIC has been running Arabic emails for some time, but it’s a home-grown and closed system – they created their own protocols, because there were no IETF protocols at the time – the addresses are actually converted to punycode for transmission, but displayed as Arabic to the user (http://nic.sa).

Google uses system information about language preferences of the user to determine whether or not to display the IDN rather than punycode in Chrome’s address bar, but this could cause problems for people using a shared computer, for example in an internet café, a conference laptop etc. They are still worrying about users’ reactions if they can’t read/display an email address in non-ASCII script. For email, currently they’re leaning towards just always showing the Unicode version, with the caveat that they will take a hard line on mixed script (other than something mixed with ASCII) where they may just reject the mail.

A trend to note is a growing number of redirects from IDN to ASCII, eg. http://правительство.рф page shows http://government.ru in the address bar when you reach the site.

Other observations

All the Arabic email addresses I saw were shown fully right to left, ie. <tld><domain>@<username>. I wonder whether this may dislodge some of the hesitation in the IETF about the direction in which web addresses should be displayed – perhaps they should therefore also flow right-to-left?? (especially if people write domain names without http://, which these guys seem to think they will).

Many of the people in the room wanted to dispense with the http:// for display of web addresses, to eliminate the ASCII altogether, also get rid of www. – problem is, how to identify the string as a domain name – is the dot sufficient?? We saw some examples of this, but they had something like “see this link” alongside.

By the way, Google is exploring the idea of showing the user, by default, only the domain name of a URL in future versions of the Chrome browser address bar. A Google employee at the workshop said “I think URLs are going away as far as something to be displayed to users – the only thing that matters is the domain name … users don’t understand the rest of the URL”. I personally don’t agree with this.

One participant proposed that government mandates could be very helpful in encouraging adaptation of technologies to support international domain names.

My comments

I gave a talk and was on a panel. Basically my message was:

Most of the technical developments for IDN and IRIs were developed at the IETF and the Unicode Consortium, but with significant support by people involved in the W3C Internationalization Working Group. Although the W3C hasn’t been leading this work, it is interested in understanding the issues and providing support where appropriate. We are, however, also interested in wider issues surrounding the full path name of the URL (not just the domain name), 3rd level domain labels, frag ids, IRI vs punycode for domain name escaping, etc. We also view domain names as general resource identifiers (eg. for use in linked data), not just for a web presence and marketing.

I passed on a message that groups such as the Wikimedia folks I met with in Madrid the week before are developing a very wide range of fonts and input mechanisms that may help users input non-Latin IDs on terminals, mobile devices and such like, especially when travelling abroad. It’s something to look into. (For more information about Wikimedia’s jQuery extensions, see here and here.)

I mentioned the idea of bidi issues related to both the overall direction of Arabic/Hebrew/etc URLs/domain names, and the more difficult question about to handle mixed direction text that can make the logical http://www.oman/muscat render to the user as http://www.muscat/oman when ‘muscat’ and ‘oman’ are in Arabic, due to the default properties of the Unicode bidi algorithm. Community guidance would be a help in resolving this issue.

I said that the W3C is all about getting people together to find interoperable solutions via consensus, and that we could help with networking to bring the right people together. I’m not proposing that we should take on ownership of the general problem of Universal Acceptance, but I did suggest that if they can develop specific objectives for a given aspect of the problem, and identify a natural community of stakeholders for that issue, then they could use our Community Groups to give some structure to and facilitate discussions.

I also suggested that we all engage in grass-roots lobbying, requesting that service/tool providers allow us to use IDNs.

Conclusions

At the end of the first day, Don Hollander summed up what he had gathered from the presentations and discussions as follows:

People want IDNs to work, they are out there, and they are not going away. Things don’t appear quite so dire as he had previously thought, given that browser support is generally good, closed email communities are developing, and search and indexing works reasonably well. Also Google and Microsoft are working on it, albeit perhaps slower than people would like (but that’s because of the complexity involved). There are, however, still issues.

The question is how to go forward from here. He asked whether APTLD should coordinate all communities at a high level with a global alliance. After comments from panelists and participants, he concluded that APTLD should hold regular meetings to assess and monitor the situation, but should focus on advocacy. The objective would be to raise visibility of the issues and solutions. “The greatest contribution from Google and Microsoft may be to raise the awareness of their thousands of geeks.” ICANN offered to play a facilitation role and to generate more publicity.

One participant warned that we need a platform for forward motion, rather than just endless talking. I also said that in my panel contributions. I was a little disappointed (though not particularly surprised) that APTLD didn’t try to grasp the nettle and set up subcommittees to bring players together to take practical steps to address interoperable solutions, but hopefully the advocacy will help move things forward and developments by companies such as Google and Microsoft will help start a ball rolling that will eventually break the deadlock.

Picture of the page in action.

This kind of list could be used to set font-family styles for CSS, if you want to be reasonably sure what the user will see, or it could be used just to find a font you like for a particular script.

I’ve updated the page to show the fonts added in Windows8. This is the list:

  • Aldhabi (Urdu Nastiliq)
  • Urdu Typesetting (Urdu Nastiliq)
  • Gadugi (Cherokee/Unified Canadian Aboriginal Syllabics)
  • Myanmar Text (Myanmar)
  • Nirmala UI (10 Indic scripts)

There were also two additional UI fonts for Chinese, Jhenghei UI (Traditional) and Yahei UI (Simplified), which I haven’t listed. Also Microsoft Uighur acquired a bold font.

>> See the list

See the blog post for the first version or the page for more information.

Update, 25 Jan 2013

Patrick Andries pointed out that Tifinagh using the Windows Ebrima font was missing from the list. Not any more.

Characters in the Unicode Balinese block.

I just uploaded an initial draft of an article Balinese Script Notes. It lists the Unicode characters used to represent Balinese text, and briefly describes their use. It starts with brief notes on general script features and discussions about which Unicode characters are most appropriate when there is a choice.

The script type is abugida – consonants carry an inherent vowel. It’s a complex script derived from Brahmi, and has lots of contextual shaping and positioning going on. Text runs left-to-right, and words are not separated by spaces.

I think it’s one of the most attractive scripts in Unicode, and for that reason I’ve been wanting to learn more about it for some time now.

>> Read it


A translate attribute was recently added to HTML5. At the three MultilingualWeb workshops we have run over the past two years, the idea of this kind of ‘translate flag’ has constantly excited strong interest from localizers, content creators, and from folks working with language technology.

How it works

Typically authors or automated script environments will put the attribute in the markup of a page. You may also find that, in industrial translation scenarios, localizers may add attributes during the translation preparation stage, as a way of avoiding the multiplicative effects of dealing with mistranslations in a large number of languages.

There is no effect on the rendered page (although you could, of course, style it if you found a good reason for doing so). The attribute will typically be used by workflow tools when the time comes to translate the text – be it by the careful craft of human translators, or by quick gist-translation APIs and services in the cloud.

The attribute can appear on any element, and it takes just two values: yes or no. If the value is no, translation tools should protect the text of the element from translation. The translation tool in question could be an automated translation engine, like those used in the online services offered by Google and Microsoft. Or it could be a human translator’s ‘workbench’ tool, which would prevent the translator inadvertently changing the text.

Setting this translate flag on an element applies the value to all contained elements and to all attribute values of those elements.

You don’t have to use translate="yes" for this to work. If a page has no translate attribute, a translation system or translator should assume that all the text is to be translated. The yes value is likely to see little use, though it could be very useful if you need to override a translate flag on a parent element and indicate some bits of text that should be translated. You may want to translate the natural language text in examples of source code, for example, but leave the code untranslated.

Why it is needed

You come across a need for this quite frequently. There is an example in the HTML5 spec about the Bee Game. Here is a similar, but real example from my days at Xerox, where the documentation being translated referred to a machine with text on the hardware that wasn’t translated.

<p>Click the Resume button on the Status Display or the
<span class="panelmsg" translate="no">CONTINUE</span> button
on the printer panel.</p>

Here are a couple more (real) examples of content that could benefit from the translate attribute. The first is from a book, quoting a title of a work.

<p>The question in the title <cite translate="no">How Far Can You Go?</cite> applies to both the undermining of traditional religious belief by radical theology and the undermining of literary convention by the device of "breaking frame"...</p>

The next example is from a page about French bread – the French for bread is ‘pain‘.

<p>Welcome to <strong translate="no">french pain</strong> on Facebook. Join now to write reviews and connect with <strong translate="no">french pain</strong>. Help your friends discover great places to visit by recommending <strong translate="no">french pain</strong>.</p>

So adding the translate attribute to your page can help readers better understand your content when they run it through automatic translation systems, and can save a significant amount of cost and hassle for translation vendors with large throughput in many languages.

What about Google Translate and Microsoft Translator?

Both Google and Microsoft online translation services already provided the ability to prevent translation of content by adding markup to your content, although they did it in (multiple) different ways. Hopefully, the new attribute will help significantly by providing a standard approach.

Both Google and Microsoft currently support class="notranslate", but replacing a class attribute value with an attribute that is a formal part of the language makes this feature much more reliable, especially in wider contexts. For example, a translation prep tool would be able to rely on the meaning of the HTML5 translate attribute always being what is expected. Also it becomes easier to port the concept to other scenarios, such as other translation APIs or localization standards such as XLIFF.

As it happens, the online service of Microsoft (who actually proposed a translate flag for HTML5 some time ago) already supported translate="no". This, of course, was a proprietary tag until now, and Google didn’t support it. However, just yesterday morning I received word, by coincidence, that Webkit/Chromium has just added support for the translate attribute, and yesterday afternoon Google added support for translate="no" to its online translation service. See the results of some tests I put together this morning. (Neither yet supports the translate="yes" override.)

In these proprietary systems, however, there are a good number of other non-standard ways to express similar ideas, even just sticking with Google and Microsoft.

Microsoft apparently supports style="notranslate". This is not one of the options Google lists for their online service, but on the other hand they have things that are not available via Microsoft’s service.

For example, if you have an entire page that should not be translated, you can add <meta name="google" value="notranslate"> inside the head element of your page and Google won’t translate any of the content on that page. (However they also support <meta name="google" content="notranslate">.) This shouldn’t be Google specific, and a single way of doing this, ie. translate="no" on the html tag, is far cleaner.

It’s also not made clear, by the way, when dealing with either translation service, how to make sub-elements translatable inside an element where translate has been set to no – which may sometimes be needed.

As already mentioned, the new HTML5 translate attribute provides a simple and standard feature of HTML that can replace and simplify all these different approaches, and will help authors develop content that will work with other systems too.

Can’t we just use the lang attribute?

It was inevitable that someone would suggest this during the discussions around how to implement a translate flag, however overloading language tags is not the solution. For example, a language tag can indicate which text is to be spellchecked against a particular dictionary. This has nothing to do with whether that text is to be translated or not. They are different concepts. In a document that has lang="en" in the html header, if you set lang="notranslate" lower down the page, that text will now not be spellchecked, since the language is no longer English. (Nor for the matter will styling work, voice browsers pronounce correctly, etc.)

Going beyond the translate attribute

The W3C’s ITS (International Tag Set) Recommendation proposes the use of a translate flag such as the attribute just added to HTML5, but also goes beyond that in describing a way to assign translate flag values to particular elements or combinations of markup throughout a document or set of documents. For example, you could say, if it makes sense for your content, that by default, all p elements with a particular class name should have the translate flag set to no for a specific set of documents.

Microsoft offers something along these lines already, although it is much less powerful than the ITS approach. If you use <meta name="microsoft" content="notranslateclasses myclass1 myclass2" /> anywhere on the page (or as part of a widget snippet) it ensures that any of the CSS classes listed following “notranslateclasses” should behave the same as the “notranslate” class.

Microsoft and Google’s translation engines also don’t translate content within code elements. Note, however, that you don’t seem to have any choice about this – there don’t seem to be instructions about how to override this if you do want your code element content translated.

By the way, there are plans afoot to set up a new MultilingualWeb-LT Working Group at the W3C in conjunction with a European Commission project to further develop ideas around the ITS spec, and create reference implementations. They will be looking, amongst many other things, at ways of integrating the new translate attribute into localization industry workflows and standards. Keep an eye out for it.

Picture of the page in action.

I’ve wanted to get around to this for years now. Here is a list of fonts that come with Windows7 and Mac OS X Snow Leopard/Lion, grouped by script.

This kind of list could be used to set font-family styles for CSS, if you want to be reasonably sure what the user will see, or it could be used just to find a font you like for a particular script. I’m still working on the list, and there are some caveats.

>> See the list

Some of the fonts listed above may be disabled on the user’s system. I’m making an assumption that someone who reads tibetan will have the Tibetan font turned on, but for my articles that explain writing systems to people in English, such assumptions may not hold.

The list I used to identify Windows fonts is Windows7-specific and fairly stable, but the Mac font spans more than one version of Mac OS X, and I could only find an unofficial list of fonts for Snow Leopard, and there were some fonts on that list that I didn’t have on my system. Where a Mac font is new with Lion (and there are a significant number) it is indicated. See the official list of fonts on Mac OS X Lion.

There shouldn’t be any fonts listed here for a given script that aren’t supplied with Windows7 or Mac OS X Snow Leopard/Lion, but there are probably supplied fonts that are not yet listed here (typically these will be large fonts that cover multiple scripts). In particular, note that I haven’t yet made a list of fonts that support Latin, Greek and Cyrillic (mainly because there are so many of them and partly because I’m wondering how useful it will be.)

The text used is as much as would fit on one line of article 1 of the Universal Declaration of Human Rights, taken from this Unicode page, wherever I could find it. I created a few instances myself, where it was missing, and occasionally I resorted to arbitrary lists of characters.

You can obtain a character-based version of the text used by looking at the source text: look for the title attribute on the section heading.

Things still to do:

  • create sections for Latin, Greek and Cyrillic fonts
  • check for fonts covering multiple Unicode blocks
  • figure out how to tell, and how to show which is the system default
  • work out and show what’s not available in Windows XP
  • work out what’s new in Lion, and whether it’s worth including them
  • figure out whether people with different locale setups see different things
  • recapture all font images that need it at 36px, rather than varying sizes

Update, 19 Feb 2012

I uploaded a new version of the font list with the following main changes:

  • If you click on an image you see text with that font applied (if you have it on your system, of course). The text can be zoomed from 14px to 100px (using a nice HTML5 slider, if you have the right browser! [try Chrome, Safari or Opera]). This text includes a little Latin text so you can see the relationship between that and the script.
  • All font graphics are now standardised so that text is imaged at a font size of 36px. This makes it more difficult to see some fonts (unless you can use the zoom text feature), but gives a better idea of how fonts vary in default size.
  • I added a few extra fonts which contained multiple script support.
  • I split Chinese into Simplified and Traditional sections.
  • Various other improvements, such as adding real text for N’Ko, correcting the Traditional Chinese text, flipping headers to the left for RTL fonts, reordering fonts so that similar ones are near to each other, etc.

These are notes on using CSS @font-face to gain finer control over the fonts applied to characters in particular Unicode ranges of your text, without resorting to additional markup. Possibilities and problems.

Changing the font used for certain characters

Most non-English fonts mix glyphs from different writing systems. Usually the font contains glyphs for Latin characters plus a non-Latin script, for example English+Japanese, or English+Thai, etc.

Normally the font designer will take care to harmonise the Latin script glyphs with the non-Latin, but there may be cases where you want to change the design of the glyphs for, say, and embedded script without adding markup to your page.

For example, if I apply the MS-Mincho font to some content in Japanese with embedded Latin text I’m likely to see the following:

Let’s suppose I’d like the English text to appear in a different, proportionally-spaced font. I could put markup around the English and set a class on the markup to apply the font I want, but this is very time consuming and bloats your code.

An alternative is to use @font-face. Here is an example:

@font-face { 
  font-family: myJapanesefont;
  src: local(MS-Mincho);
  }
@font-face {
  font-family: myJapanesefont;
  src: local(Gentium);
  unicode-range: U+41-5A, U+61-7A, U+C0-FF;
  }
p { 
  font-family: myJapanesefont; 
  }

Note: When specifying src the local() keyword indicates that font-face should look for the font on the user’s system. Of course, to improve interoperability, you may want to specify a number of alternatives here, or a downloadable WOFF font. The most interoperable value to use for local() is the Postscript name of the font. (On the Mac open Font Book, select the font, and choose Preview > Show Font Information to find this.)

The result would be:

The first font-face declaration associates the MS-Mincho font with the name ‘myJapanesefont’. The second font-face declaration associates the Gentium font with the Unicode code points in the Latin-1 letter range (of course, you can extend this if you use Latin characters outside that range and they are covered by the font).

Note how I was careful to set the unicode-range values to exclude punctuation (such as the exclamation mark) that would be used by (and harmonised with) the Japanese characters.

Adding support for new characters to a font

You can use the same approach for fonts that don’t have support for a particular Unicode range.

For example, the Nafees Nastaliq font has no glyphs for the Latin range (other than digits), so the browser falls back to the system default.

With the following code, I can produce a more pleasant font for the ‘W3C’ part:

@font-face {; 
  font-family: myUrduFont;
  src: local(NafeesNastaleeq);
  }
@font-face {
  font-family: myUrduFont;
  src: local(BookAntiqua);
  unicode-range: U+30-FF;
  }
div p { 
  font-family: myUrduFont; 
  font-size: 60px;
  }

A big fly in the ointment

If you look at the ranges in the unicode-range value, you’ll see that I kept to just the letters of the alphabet in the Japanese example, and the missing glyphs in the Urdu case.

There are a number of characters that are used by all scripts, however, and these cause problems because you can’t apply fonts based on the context – even if you could work out what that context was.

In the case of the Japanese example above, numbers are left to be rendered by the Mincho font, but when those characters appear in the Latin text, they look incorrectly sized. Look, for example, at the 3 in W3C below.

The same problem arises with spaces and punctuation marks. The exclamation mark was left in the Mincho font in the Japanese example because, in this case, it is part of the Japanese text. Punctuation of this kind, could however be associated with the Latin text. See the question mark in this example.

Even more problematic are the spaces in that example. They are too wide in the Latin text. In Urdu text we have the opposite problem, use Urdu space glyphs in Latin text and you don’t see them at all (there should be a gap between W3C and i18n below).

With my W3C hat on, I start wondering whether there are any rules we can use to apply different glyphs for some characters depending on the script context in which they are used, but then I realise that this is going to bring in all the problems we already have for bidi text when dealing with punctuation or spaces between flows of text in different scripts. I’m not sure it’s a tractable problem without resorting to markup to delimit the boundaries. But then, of course, we end up right back where we started.

So it seems, disappointingly, that the unicode-range property is destined to be of only limited usefulness for me. That’s a real shame.

Another small issue

The examples don’t show major problems, but I assume that sometimes the fonts you want to bring together using font-face will have very different aspect ratios, so you may need to use something like font-size-adjust to balance the size of the fonts being used.

Browser support

The CSS code above worked for me in Chrome and Safari on Mac OS X 10.6. but didn’t work in Firefox or Opera. Nor did it work in IE9 on Windows7.

There appears to be some confusion about XHTML1.0 vs XHTML5. Here is my best shot at an explanation of what XHTML5 is.

* This post is written for people with some background in MIME types and html/xml formats. In case that’s not you, this may give you enough to follow the idea: ‘served as’ means sent from a server to the browser with a MIME type declaration in the HTTP protocol header that says that the content of the page is HTML (text/html) or XML (eg. application/xhtml+xml). See examples and more explanations.

XHTML5 is an HTML5 document served as* application/xhtml+xml (or another XML mime type). The syntax rules for XHTML5 documents are simply those rules given by the XML specification. The vocabulary (elements and attributes) is defined by the HTML5 spec.

Anything served as text/html is not XHTML5.

Note that HTML5 (without the X) can be written in a style that looks like XML syntax. For example, using a / in empty elements (eg. <img src="..." />), or using quotes around attributes. But code written this way is still HTML5, not XHTML5, if it is served as text/html.

There are normally other differences between HTML5 and XHTML5. For example, XHTML5 documents may have an XML declaration at the start of the document. HTML5 documents cannot have that. XHTML5 documents are likely to have a more complicated doctype (to facilitate XML processing). And XHTML5 documents will have an xmlns attribute on the html tag. There are a few other HTML5 features that are not compatible with XML, and must be avoided.

Similar differences existed between HTML 4.01 and XHTML 1.0. However, moving on from XHTML 1.0 will typically involve a subtle but significant shift in thinking. You might have written XHTML 1.0 with no intention of serving it as anything other than text/html. XHTML in the XHTML 1.0 sense tended to be seen largely as a difference in syntax; it was originally designed to be served as XML, but (with some customisations to suit HTML documents) could be, and usually was, served with an HTML mime type. XHTML in the XHTML5 sense, means HTML5 documents served with an XML mime type (and appropriate customisations to suit XML documents), ie. it’s the MIME type, not the syntax, that makes it XHTML.

Which brings us to Polyglot documents. A polyglot document is a document that is the subset of HTML5 and XML that can be processed as either HTML or XHTML, and can be served as either text/html or application/xhtml+xml, ie. as either HTML5 or XHTML5, without any errors or warnings in either case. The polyglot spec defines the things which allow this compatibility (such as using no XML declaration, proper casing of element names, etc.), and which things to avoid. It also mandates at least one additional extra, ie. disallowing UTF-16 encoded documents.

One of the more useful features of UniView is its ability to list the characters in a string with names and codepoints. This is particularly useful when you can’t tell what a string of characters contains because you don’t have a font, or because the script is too complex, etc.

'ishida' in Persian in  nastaliq font style

For example, I was recently sent an email where my name was written in Persian as ایشی‌دا. The image shows how it looks in a nastaliq font.

To see the component characters, drop the string into UniView’s Copy & Paste field and click on the downwards pointing arrow icon. Here is the result:

list of characters

Note how you can now see that there’s an invisible control character in the string. Note also that you see a graphic image for each character, which is a big help if the string you are investigating is just a sequence of boxes on your system.

Not only can you discover characters in this way, but you can create lists of characters which can be pasted into another document, and customise the format of those lists.

Pasting the list elsewhere

If you select this list and paste it into a document, you’ll see something like this:

  0627  ARABIC LETTER ALEF
  06CC  ARABIC LETTER FARSI YEH
  0634  ARABIC LETTER SHEEN
  06CC  ARABIC LETTER FARSI YEH
  200C  ZERO WIDTH NON-JOINER
  062F  ARABIC LETTER DAL
  0627  ARABIC LETTER ALEF

You can make the characters appear by deselecting Use graphics on the Look up tab. (Of course, you need an arabic font to see the list as intended.)

ا  ‎0627  ARABIC LETTER ALEF
ی  ‎06CC  ARABIC LETTER FARSI YEH
ش  ‎0634  ARABIC LETTER SHEEN
ی  ‎06CC  ARABIC LETTER FARSI YEH
‌  ‎200C  ZERO WIDTH NON-JOINER
د  ‎062F  ARABIC LETTER DAL
ا  ‎0627  ARABIC LETTER ALEF

Customising the list format

What may be less obvious is that you can also customise the format of this list using the settings under the Options tab. For example, using the List format settings, I can produce a list that moves the character column between the number and the name, like this:

  0627  ا  ARABIC LETTER ALEF
  ‎06CC  ی  ARABIC LETTER FARSI YEH
  ‎0634  ش  ARABIC LETTER SHEEN
  ‎06CC  ی  ARABIC LETTER FARSI YEH
  ‎200C  ‌  ZERO WIDTH NON-JOINER
  ‎062F  د  ARABIC LETTER DAL
  ‎0627  ا  ARABIC LETTER ALEF

Or I can remove one or more columns from the list, such as:

  ا  ARABIC LETTER ALEF
  ی  ARABIC LETTER FARSI YEH
  ش  ARABIC LETTER SHEEN
  ی  ARABIC LETTER FARSI YEH
  ‌  ZERO WIDTH NON-JOINER
  د  ARABIC LETTER DAL
  ا  ARABIC LETTER ALEF

With the option Show U+ in lists I can also add or remove the U+ before the codepoint value. For example, this lets me produce the following list:

  ‎U+0627  ARABIC LETTER ALEF
  ‎U+06CC  ARABIC LETTER FARSI YEH
  ‎U+0634  ARABIC LETTER SHEEN
  ‎U+06CC  ARABIC LETTER FARSI YEH
  ‎U+200C  ZERO WIDTH NON-JOINER
  ‎U+062F  ARABIC LETTER DAL
  ‎U+0627  ARABIC LETTER ALEF

Other lists in UniView

We’ve shown how you can make a list of characters in the Cut & Paste box, but don’t forget that you can create lists for a Unicode block, custom range of characters, search list results, or list of codepoint values, etc. And not only that, but you can filter lists in various ways.

Here is just one quick example of how you can obtain a list of numbers for the Devanagari script.

  1. On the Look up tab, select Devanagari from the Unicode block pull down list.
  2. Select Show range as list and deselect (optional) Use graphics.
  3. Under the Filter tab, select Number from the Show properties pull down list.
  4. Click on Make list from highlights

You end up with the following list, that you can paste into your document.

०  ‎0966  DEVANAGARI DIGIT ZERO
१  ‎0967  DEVANAGARI DIGIT ONE
२  ‎0968  DEVANAGARI DIGIT TWO
३  ‎0969  DEVANAGARI DIGIT THREE
४  ‎096A  DEVANAGARI DIGIT FOUR
५  ‎096B  DEVANAGARI DIGIT FIVE
६  ‎096C  DEVANAGARI DIGIT SIX
७  ‎096D  DEVANAGARI DIGIT SEVEN
८  ‎096E  DEVANAGARI DIGIT EIGHT
९  ‎096F  DEVANAGARI DIGIT NINE

(Of course, you can also customise the layout of this list as described in the previous section.)

Try it out.

Reversing the process: from list to string

To complete the circle, you can also cut & paste any of the lists in the blog text above into UniView, to explore each character’s properties or recreate the string.

Select one of the lists above and paste it into the Characters input field on the Look up tab. Hit the downwards pointing arrow icon alongside, and UniView will recreate the list for you. Click on each character to view detailed information about it.

If you want to recreate the string from the list, simply click on the upwards pointing arrow icon below the Copy & paste box, and the list of characters will be reconstituted in the box as a string.

Voila!

I created a new HTML5-based template for our W3C Internationalization articles recently, and I’ve just received some requests to translate documents into Arabic and Hebrew, so I had to get around to updating the bidi style sheets. (To make it quicker to develop styles, I create the style sheet for ltr pages first, and only when that is working well do I create the rtl style sheet info.)

Here are some thoughts about how to deal with style sheets for both right-to-left (rtl) and left-to-right (ltr) documents.

What needs changing?

Converting a style sheet is a little more involved than using a global search and replace to convert left to right, and vice versa. While this may catch many of the things that need changing, it won’t catch all, and it could also introduce errors into the style sheet.

For example, I had selectors called .topleft and .bottomright in my style sheet. These, of course, shouldn’t be changed. There may also be occasional situations where you don’t want to change the direction of a particular block.

Another thing to look out for: I tend to use -left and -right a lot when setting things like margins, but where I have set something like margin: 1em 32% .5em 7.5%; you can’t just use search and replace, and you have to carefully scour the whole of the main stylesheet to find the instances where the right and left margins are not balanced.

There is a web service called CSSJanus that can apply a little intelligence to convert most of what you need. You still have to use with care, but it does come with a convention to prevent conversion of properties where needed (you can disable CSSJanus from running on an entire class or any rule within a class by prepending a /* @noflip */ comment before the rule(s) you want CSSJanus to ignore).

Note also that there are other things that may need changing besides the right and left values. For example, some of the graphics on our template need to be flipped (such as the dog-ear icon in the top corner of the page).

CSS may provide a way to do this in the future, but it is still only a proposal in a First Public Working Draft at the moment. (It would involve writing a selector such as #site-navigation:dir(rtl) { background-image: url(standards-corner-rtl.png); }.

Approach 1: extracting changed properties to an auxiliary style sheet

For the old template I have a secondary, bidi style sheet that I load after the main style sheet. This bidi style sheet contains a copy of just the rules in the main style sheet that needed changing and overwrites the styles in the main style sheet. These changes were mainly to margin, padding, and text-align properties, though there were also some others, such as positioning, background and border properties.

The cons of this approach were:

  1. it’s a pain to create and maintain a second style sheet in the first place
  2. it’s an even bigger pain to remember to copy any relevant changes in the main style sheet to the bidi style sheet, not least because the structure is different, and it’s a little harder to locate things
  3. everywhere that the main style sheet declared, say, a left margin without declaring a value for the right margin, you have to figure out what that other margin should be and add it to the bidi style sheet. For example, if a figure has just margin-left: 32%, that will be converted to margin-right: 32%, but because the bidi style sheet hasn’t overwritten the main style sheet’s margin-left value, the Arabic page will end up with both margins set to 32%, and a much thinner figure than desired. To prevent this, you need to figure out what all those missing values should be, which is typically not straightforward, and add them explicitly to the bidi style sheet.
  4. downloading a second style sheet and overwriting styles leads to higher bandwidth consumption and more processing work for the rtl pages.

Approach 2: copying the whole style sheet and making changes

This is the approach that I’m trying for the moment. Rather than painstakingly picking out just the lines that changed, I take a copy of the whole main style sheet, and load that with the article instead of the main style sheet. Of course, I still have to change all the lefts to rights, and vice versa, and change all the graphics, etc. But I don’t need to add additional rules in places where I previously only specified one side margin, padding, etc.

We’ll see how it works out. Of course, the big problem here is that any change I make to the main style sheet has to be copied to the bidi style sheet, whether it is related to direction or not. Editing in two places is definitely going to be a pain, and breaks the big advantage that style sheets usually give you of applying changes with a single edit. Hopefully, if I’m careful, CSSJanus will ease that pain a little.

Another significant advantage should be that the page loads faster, because you don’t have to download two style sheets and overwrite a good proportion of the main style sheet to display the page.

And finally, as long as I format things exactly the same way, by running a diff program I may be able to spot where I forgot to change things in a way that’s not possible with approach 1.

Approach 3: using :lang and a single file

On the face of it, this seems like a better approach. Basically you have a single style sheet, but when you have a pair of rules such as p { margin-right: 32%; margin-left: 7.5%;} you add another line that says p:lang(ar) { margin-left: 32%; margin-right: 7.5%; }.

For small style sheets, this would probably work fine, but in my case I see some cons with this approach, which is why I didn’t take it:

  1. there are so many places where these extra lines need to be added that it will make the style sheet much harder to read, and this is made worse because the p:lang(ar) in the example above would actually need to be p:lang(ar), p:lang(he), p:lang(ur), p:lang(fa), p:lang(dv) ..., which is getting very messy, but also significantly pumps up the bandwidth and processing requirements compared with approach 2 (and not only for rtl docs).
  2. you still have to add all those missing values we talked about in approach 1 that were not declared in the part of the style sheet dealing with ltr scripts
  3. the list of languages could be long, since there is no way to say “make this rule work for any language with a predominantly rtl script”, and obscures those rules that really are language specific, such as for font settings, that I’d like to be able to find quickly when maintaining the style sheet
  4. you really need to use the :lang() selector for this, and although it works on all recent versions of major browsers, it doesn’t work on, for example, IE6

Having said that, I may use this approach for the few things that CSSJanus can’t convert, such as flipping images. That will hopefully mean that I can produce the alternative stylesheet in approach 2 just by running through CSSJanus. (We’ll see if I’m right in the long run, but so far so good…)

Approach 4: what I’d really like to do

The cleanest way to reduce most of these problems would be to add some additional properties or values so that if you wanted to you could replace

p { margin-right: 32%; margin-left: 7.5%; text-align: left; }

with

p { margin-start: 32%; margin-end: 7.5%; text-align: start; }

Where start refers to the left for ltr documents and right for rtl docs. (And end is the converse.)

This would mean that that one rule would work for both ltr and rtl pages and I wouldn’t have to worry about most of the above.

The new properties have been strongly recommended to the CSS WG several times over recent years, but have been blocked mainly by people who fear that a proliferation of properties or values is confusing to users. There may be some issues to resolve with regards to the cascade, but I’ve never really understood why it’s so hard to use start and end. Nor have I met any users of RTL scripts (or vertical scripts, for that matter) who find using start and end more confusing than using right and left – in fact, on the contrary, the ones I have talked with are actively pushing for the introduction of start and end to make their life easier. But it seems we are currently still at an impasse.

text-align

Similarly, a start and end value for text-align would be very useful. In fact, such a value is in the CSS3 Text module and is already recognised by latest versions of Firefox, Safari and Chrome, but unfortunately not IE8 or Opera, so I can’t really use it yet.

In my style sheet, due to some bad design on my part, what I actually needed most of the time was a value that says “turn off justify and apply the current default” – ie. align the text to left or right depending on the current direction of the text. Unfortunately, I think that we have to wait for full support of the start and end values to do that. Applying text-align:left to unjustify, say, p elements in a particular context causes problems if some of those p elements are rtl and others ltr. This is because, unlike mirroring margins or padding, text-align is more closely associated with the text itself than with page geometry. (I resolved this by reworking the style sheet so that I don’t need to unjustify elements, but I ought to follow my own advice more in future, and avoid using text-align unless absolutely necessary.)

In the phrase “Zusätzlich erleichtert PLS die Eingrenzung von Anwendungen, indem es Aussprachebelange von anderen Teilen der Anwendung abtrennt.” (“In addition, PLS facilitates the localization of applications by separating pronunciation concerns from other parts of the application.”) there are many long words. To fit these in narrow columns (coming soon to the Web via CSS) or on mobile devices, it would help to automatically hyphenate them.

Other major browsers already supported soft-hyphens when Firefox 5 implemented FF support. Soft hyphens provide a manual workaround for breaking long words, but more recently browsers such as Firefox, Safari and Chrome have begun to support the CSS3 hyphens property, with hyphenation dictionaries for a range of languages, to support (or disable, if needed) automatic hyphenation. (Note, however, that Aussprachebelange is incorrectly hyphenated in the example from Safari 5.1 on Lion OS shown above. It is hyphenated as Aussprac- hebelange. Some refinement is clearly still needed at this stage.)

For hyphenation to work correctly, the text must be marked up with language information, using the language tags described earlier. This is because hyphenation rules vary by language, not by script. The description of the hyphens property in CSS says “Correct automatic hyphenation requires a hyphenation resource appropriate to the language of the text being broken. The UA is therefore only required to automatically hyphenate text for which the author has declared a language (e.g. via HTML lang or XML xml:lang) and for which it has an appropriate hyphenation resource.”

This post is a place for me to dump a few URIs related to this topic, so that i can find them again later.

Hyphenation arrives in Firefox and Safari
http://blog.fontdeck.com/post/9037028497/hyphens

hyphens
https://developer.mozilla.org/en/CSS/hyphens#Gecko_notes
(lists languages to be supported by FF8)

Hyphenation on the web
http://www.gyford.com/phil/writing/2011/06/10/web-hyphenation.php

css text
http://www.gyford.com/phil/writing/2011/06/10/web-hyphenation.php

css generated content
http://dev.w3.org/csswg/css3-gcpm/#hyphenation

The html5 specification contains a bunch of new features to support bidirectional text in web pages. Language written with right-to-left scripts, such as Arabic, Hebrew, Persian, Thaana, Urdu, etc., commonly mixes in words or phrases in English or some other language that uses a left-to-right script. The result is called bidirectional or bidi text.

HTML 4.01 coupled with the Unicode Bidirectional algorithm already does a pretty good job of managing bidirectional text, but there are still some problems when dealing with embedded text from user input or from stored data.

The problem

Here’s an example where the names of restaurants are added to a page from a database. This is the code, with the Hebrew shown using ASCII:

<p>Aroma - 3 reviews</p>
<p>PURPLE PIZZA - 5 reviews</p>

And here’s what you’d expect to see, and what you’d actually see.

What it should look like.

AZZIP ELPRUP - 5 reviews

What it actually looks like.

5 - AZZIP ELPRUP reviews


The problem arises because the browser thinks that the ” – 5″ is part of the Hebrew text. This is what the Unicode Bidi Algorithm tells it to do, and usually it is correct. Not here though.

So the question is how to fix it?

<bdi> to the rescue

The trick is to use the bdi element around the text to isolate it from its surrounding content. (bdi stands for ‘bidi-isolate’.)

<p><bdi>Aroma</bdi> - 3 reviews</p>
<p><bdi>PURPLE PIZZA</bdi> - 5 reviews</p>

The bidi algorithm now treats the Hebrew and “- 5″ as separate chunks of content, and orders those chunks per the direction of the overall context, ie. from left-to-right here.

You’ll notice that the example above has bdi around the name Aroma too. Of course, you don’t actually need that, but it won’t do any harm. On the other hand, it means you can write a script in something like PHP that says:

foreach $restaurant echo "<bdi>$restaurant['name']</bdi> - $restaurant['reviews'] reviews"; 

This means you can handle any name that comes out of the database, whatever script it is in.

bdi isn’t supported fully by all browsers yet, but it’s coming.

Things to avoid

Using the dir attribute on a span element

You may think that something like this would work:

<p><span dir=rtl>PURPLE PIZZA</span> - 5 reviews</p>

But actually that won’t make any difference, because it doesn’t isolate the content of the span from what surrounds it.

Using Unicode control characters

You could actually produce the desired result in this case using U+200E LEFT-TO-RIGHT MARK just before the hyphen.

<p>PURPLE PIZZA &lrm;- 5 reviews</p>

For a number of reasons, however, it is better to use markup. Markup is part of the structure of the document, it avoids the need to add logic to the application to choose between LRM and RLM, and it doesn’t cause search failures like the Unicode characters sometimes do. Also, the markup can neatly deal with any unbalanced embedding controls inadvertently left in the embedded text.

Using CSS

CSS has also been updated to allow you to isolate text, but you should always use dedicated markup for bidi rather than CSS. This means that the information about the directionality of the document is preserved even in situations where the CSS is not available.

Using bdo

Although it sounds similar, and it’s used for bidi text too, the bdo element is very different. It overrides the bidi algorithm altogether for the text it contains, and doesn’t isolate its contents from the surrounding text.

Using the dir attribute with bdi

The dir attribute can be used on the bdi element to set the base direction. With simple strings of text like PURPLE PIZZA you don’t really need it, however if your bdi element contains text that is itself bidirectional you’ll want to indicate the base direction.

Until now, you could only set the dir attribute to ltr or rtl. The problem is that in a situation such as the one described above, where you are pulling strings from a database or user, you may not know which of these you need to use.

That’s why html5 has provided a new auto value for the dir attribute, and bdi comes with that set by default. The auto value tells the browser to look at the first strongly typed character in the element and work out from that what the base direction of the element should be. If it’s a Hebrew (or Arabic, etc.) character, the element will get a direction of rtl. If it’s, say, a Latin character, the direction will be ltr.

There are some rare corner cases where this may not give the desired outcome, but in the vast majority of cases it should produce the expected result.

Want another use case?

Here’s another situation where bdi can be useful. This time we are constructing multilingual breadcrumbs on the W3C i18n site. The page titles are generated by a script, and this page is in Hebrew, so the base direction is right-to-left.

Again here’s what you’d expect to see, and what you’d actually see.

What it should look like.

Articles < Resources < WERBEH

What it actually looks like.

Resources < Articles < WERBEH


Whereas in the previous example we were dealing with a number that was confused about its directionality, here we are dealing with a list of same script items in a base direction of the opposite direction.

If you wanted to generate markup that would produce the right ordering, whatever combination of titles was thrown at it, you could wrap each title in bdi elements.

Want more information?

The inclusion of these features has been championed by Aharon Lanin of Google within the W3C Internationalization (i18n) Working Group. He is the editor of a W3C Working Draft, Additional Requirements for Bidi in HTML, that tracks a range of proposals made to the HTML5 Working Group, giving rationales and recording resolutions. (The bdi element started out as a suggestion to include a ubi attribute.)

If you like more information on handling bidi in HTML in general, try Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts

And here’s the description of bdi in the HTML5 spec.

Bopomofo, or zhùyīn fúhào, is an alphabet that is used for phonetic transliteration of Chinese text. It is usually only used in dictionaries or educational texts, to clarify the pronunciation of the Chinese ideographic characters.

This post is intended to evolve over time. I’ll post other blog posts or tweets as it changes. The current content is to the best of my knowledge correct. Please contribute comments (preferably with pointers to live examples) to help build an accurate picture if you spot something that needs correcting or expanding.

The name bopomofo is equivalent to saying “ABCD” in English, ie. it strings together the pronunciation of the first four characters in the zhuyin fuhao alphabet.

For more information about bopomofo, see Wikipedia and the Unicode Standard.

In this post we will summarise how bopomofo is displayed, to assist people involved in developing the CSS3 Ruby specification. These notes will focus on typical usage for Mandarin Chinese, rather than the extended usage for Minnan and Hakka languages.

Characters and tone marks

These are the bopomofo characters in the basic Unicode Bopomofo block.

One of these characters, U+3127 BOPOMOFO LETTER I, can appear as either a horizontal or vertical line, depending on the context.

In addition to the base characters, there are a set of Unicode characters that are used to express tones. For Mandarin Chinese, these characters are :

02C9 MODIFIER LETTER MACRON
02CA MODIFIER LETTER ACUTE ACCENT
02C7 CARON
02CB MODIFIER LETTER GRAVE ACCENT
02D9 DOT ABOVE

See the list in UniView.

It is important to understand that bopomofo tone marks are not combining characters. They are regular spacing characters that are stored after the sequence of bopomofo letters that make up a syllable. These tone marks can be displayed alongside bopomofo base characters in one of two ways.

Bopomofo used as ruby

When used to describe the phonetics of Chinese ideographs in running text (ie. ruby), bopomofo can be rendered in different ways. A bopomofo transliteration is always done on a character by character basis (ie. mono-ruby).

Horizontal base, horizontal ruby

In this approach the bopomofo is generally written above horizontal base text.

There appear to be two ways of displaying tone marks: (1) following the bopomofo characters for each ideograph, and (2) above the bopomofo characters, as if they were combining characters. We need clarity on which of these approaches is most common, and which needs to be supported. For details about tone placement in (2) see the next section.

Tones following:
bopo-horiz-toneafter
Tones above:
bopo-horiz-tone-above

Horizontal base, vertical ruby

This is a common configuration. The bopomofo appears in a vertical line to the right of each base character. In general, tone marks then appear to the right of the bopomofo characters, however there are some complications with regard to the actual positioning of these marks (see the next section for details).

Example.

Vertical base, vertical ruby

This works just like horizontal base+vertical ruby.

Example.

Vertical base, horizontal ruby

I don’t believe that this exists.

Tones in bopomofo ruby

In ruby text, tones 2-4 are displayed in their own vertical column to the right of the bopomofo letters, and tone 1 is displayed above the column of bopomofo letters.

The first tone

The first tone is not displayed. Here is an example of a syllable with the first tone. There are two bopomofo letters, but no tone mark.

Picture showing the dot above other bopomofo characters.

Tones 2 to 4

The position of tones 2-4 depends on the number of bopomofo characters the tone modifies.

The Ministry of Education in Taiwan has issued charts indicating the expected positioning for vertically aligned bopomofo that conform roughly to this diagram:

Picture showing relative positions of tone marks and vertical bopomofo.

Essentially, about half of the tone glyph box extends upwards from the top of the last bopomofo character box.

Tones in horizontal ruby are placed differently, relative to the bopomofo characters, according to the Ministry charts. Essentially, about half the width of the tone glyph extends to the right of the last bopomofo character in the sequence.

The charts cover alignment for vertical text (here, here and here) and for horizontal text (here, here and here).

In some cases the tone appears to be simply displayed alongside the last character in vertical text, as shown in these examples:

Three characters with the MODIFIER LETTER ACUTE ACCENT tone, but different numbers of bopomofo letters.

The light tone

When a light tone is used (U+02D9 DOT ABOVE). This appears at the top of the column of bopomofo letters, even though when written it appears after these in memory. The image just below illustrates this.

Picture showing the dot above other bopomofo characters.

Note that the actual sequence of characters in memory is:

3109: BOPOMOFO LETTER D
3127: BOPOMOFO LETTER I
02D9: DOT ABOVE

The apparent placing of the dot above the first bopomofo letter is an artifact of rendering only.

Bopomofo written on its own

It is not common to see text written only in bopomofo, but it does occur from time to time for Chinese, and sometimes it is used for aboriginal Taiwanese languages.

In horizontal text

When written on its own in horizontal layout any tone marks are displayed as spacing characters after the syllable they modify.

Example: Example.

In vertical text

I haven’t seen bopomofo used in its own right in vertical text, so I don’t know whether in that case one puts the tone marks below the bopomofo letters for a syllable, or to the side like when bopomofo is used as ruby.

In horizontal text

I have also come across instances where a bopomofo character has been included among Chinese ideographs. It may be that this reflects slang or colloquial usage.

Example 1. Example 2.

In this post I’m hoping to make clearer some of the concepts and issues surrounding jukugo ruby. If you don’t know what ruby is, see the article Ruby for a very quick introduction, or see Ruby Markup and Styling for a slightly longer introduction to how it was expected to work in XHTML and CSS.

You can find an explanation of jukugo ruby in Requirements for Japanese Text Layout, sections 3.3 Ruby and Emphasis Dots and Appendix F Positioning of Jukugo-ruby (you need to read both).

What is jukugo ruby?

Jukugo refers to a Japanese compound noun, ie. a word made up of more than one kanji character. We are going to be talking here about how to mark up these jukugo words with ruby.

There are three types of ruby behaviour.

Mono ruby is commonly used for phonetic annotation of text. In mono-ruby all the ruby text for a given character is positioned alongside a single base character, and doesn’t overlap adjacent base characters. Jukugo are often marked up using a mono-ruby approach. You can break a word that uses mono ruby at any point, and the ruby text just stays with the base character.

Group ruby is often used where phonetic annotations don’t map to discreet base characters, or for semantic glosses that span the whole base text. You can’t split text that is annotated with group ruby. It has to wrap a single unit onto the next line.

Jukugo ruby is a term that is used not to describe ruby annotations over jukugo text, but rather to describe ruby with a slightly different behaviour than mono or group ruby. Jukugo ruby behaves like mono ruby, in that there is a strong association between ruby text and individual base characters. This becomes clear when you split a word at the end of a line: you’ll see that the ruby text is split so that the ruby annotating a specific base character stays with that character. What’s different about jukugo ruby is that when the word is NOT split at the end of the line, there can be some significant amount of overlap of ruby text with adjacent base characters.

Example of ruby text.

The image to the right shows three examples of ruby annotating jukugo words.

In the top two examples, mono ruby can be used to produce the desired effect, since neither of the base characters are overlapped by ruby text that doesn’t relate to that character.

The third example is where we see the difference that is referred to as jukugo ruby. The first three ruby characters are associated with the first kanji character. Just the last ruby character is associated with the second kanji character. And yet the ruby text has been arranged evenly across both kanji characters.

Note, however, that we aren’t simply spreading the ruby over the whole word, as we would with group ruby. There are rules that apply, and in some cases gaps will appear. See the following examples of distribution of ruby text over jukugo words.

Various examples of jukugo ruby.

In the next part of this post I will look at some of the problems encountered when trying to use HTML and CSS for jukugo ruby.

If you want to discuss this or contribute thoughts, please do so on the public-i18n-cjk@w3.org list. You can see the archive and subscribe at http://lists.w3.org/Archives/Public/public-i18n-cjk/

Webfonts, and WOFF in particular, have been in the news again recently, so I thought I should mention that a few days ago I changed my pages describing Myanmar and Arabic-for-Urdu scripts so that you can download the necessary font support for the foreign text, either as a TTF linked font or as WOFF font.

You can find the Myanmar page at http://rishida.net/scripts/myanmar/. Look for the links n the side bar to the right, under the heading “Fonts”.

The Urdu page, using the beautiful Nastaliq script, is at http://rishida.net/scripts/urdu/.

(Note that the examples of short vowels don’t use the nastiliq style. Scroll down the page a little further.)

I haven’t had time to check whether all the opentype features are correctly rendered, but I’ve been doing Mac testing of the i18n webfonts tests, and it looks promising. (More on that later.) The Urdu font doesn’t rely on OS rendering, which should help.

Here are some examples of the text on the page:
Examples of Urdu script and Myanmar script.

http://rishida.net/scripts/indic-overview/

I finally got around to refreshing this article, by converting the Bengali, Malayalam and Oriya examples to Unicode text. Back when I first wrote the article, it was hard to find fonts for those scripts.

I also added a new feature: In the HTML version, click on any of the examples in indic text and a pop-up appears at the bottom right of the page, showing which characters the example is composed of. The pop-up lists the characters in order, with Unicode names, and shows the characters themselves as graphics.

I have not yet updated this article’s incarnation as Unicode Technical Note #10. The Indian Government also used this article, and made a number of small changes. I have yet to incorporate those, too.

I recently came across an email thread where people were trying to understand why they couldn’t see Indian content on their mobile phones. Here are some notes that may help to clarify the situation. They are not fully developed! Just rough jottings, but they may be of use.

Let’s assume, for the sake of an example, that the goal is to display a page in Hindi, which is written using the devanagari script. These principles, however, apply to one degree or another to all languages that use characters outside the ASCII range.

Let’s start by reviewing some fundamental concepts: character encodings and fonts. If you are familiar with these concepts, skip to the next heading.

Character encodings and fonts

Content is composed of a sequence of characters. Characters represent letters of the alphabet, punctuation, etc. But content is stored in a computer as a sequence of bytes, which are numeric values. Sometimes more than one byte is used to represent a single character. Like codes used in espionage, the way that the sequence of bytes is converted to characters depends on what key was used to encode the text. In this context, that key is called a character encoding.

There are many character encodings to choose from.

The person who created the content of the page you want to read should have used a character encoding that supports devanagari characters, but it should also be a character encoding that is widely recognised by browsers and available in editors. By far the best character encoding to use (for any language in the world) is called UTF-8.

UTF-8 is strongly recommended by the HTML5 draft specification.

There should be a character encoding declaration associated with the HTML code of your page to say what encoding was used. Otherwise the browser may not interpret the bytes correctly. It is also crucial that the text is actually stored in that encoding too. That means that the person creating the content must choose that encoding when they save the page from their editor. It’s not possible to change the encoding of text simply by changing the character encoding declaration in the HTML code, because the declaration is there just to indicate to the browser what key to use to get at the already encoded text.

It’s one thing for the browser to know how to interpret the bytes to represent your text, but the browser must also have a way to make those characters stored in memory appear on the screen.

A font is essential here. Fonts contain instructions for displaying a character or a sequence of characters so that you can read them. The visual representation of a character is called a glyph. The font converts characters to glyphs.

The font has tables to map the bytes in memory to text. To do this, the font needs to recognise the character encoding your page uses, and have the necessary tables to convert the characters to glyphs. It is important that the font used can work with the character encoding used in the page you want to view. Most fonts these days support UTF-8 encoded text.

Very simple fonts contain one glyph for each letter of the alphabet. This may work for English, but it wouldn’t work for a complex script such as devanagari. In these scripts the positioning and interaction of characters has to be modified according to the context in which they are displayed. This means that the font needs additional information about how to choose and postion glyphs depending on the context. That information may be built into the font itself, or the font may rely on information on your system.

Character encoding support

The browser needs to be able to recognise the character encoding used in order to correctly interpret the mapping between bytes and characters.

If the character encoding of the page is incorrectly declared, or not declared at all, there will be problems viewing the content. Typically, a browser allows the user to manually apply a particular encoding by selecting the encoding from the menu bar.

All browsers should support the UTF-8 character encoding.

Sometimes people use an encoding that is not designed for devanagari support with a font that produces the right glyphs nevertheless. Such approaches are fraught with issues and present poor interoperability on several levels. For example, the content can only be interpreted correctly by applying the specifically designed font; no other font will do if that font is not available. Also, the meaning of the text cannot be derived by machine processing, for web searches, etc., and the data cannot be easily copied or merged with other text (eg. to quote a sentence in another article that doesn’t use the same encoding). This practise seriously damages the openness of the Web and should be avoided at all costs.

System font support

Usually, a web page will rely on the operating system to provide a devanagari font. If there isn’t one, users won’t be able to see the Hindi text. The browser doesn’t supply the font, it picks it up from whatever platform the browser is running on.

If browser is running on a desktop computer, there may be a font already installed. If not, it should be possible to download free or commercial fonts and install them. If the user is viewing the page on a mobile device, it may currently be difficult to download and install one.

If there are several devanagari fonts on a system, the browser will usually pick one by default. However, if the web page uses CSS to apply styling to the page, the CSS code may specify one or more particular fonts to use for a given piece of content. If none of these are available on the system, most browsers will fall back to the default, however Internet Explorer will show square boxes instead.

Webfonts

Another way of getting a font onto the user’s system is to download it with the page, just like images are downloaded with the page. This is done using CSS code. The CSS code to do this has been defined for some years, but unfortunately most browsers implementation of this feature is still problematic.

Recently a number of major browsers have begun to support download of raw truetype or opentype fonts. Internet Explorer is not one of those. This involves simply loading the ordinary font onto a server and downloading to the browser when the page is displayed. Although the font may be cached as the user moves from page to page, there may still be some significant issues when dealing with complex scripts or Far Eastern languages (such as Chinese, Japanese and Korean) due to the size of the fonts used. The size of these fonts can often be counted in megabytes rather than kilobytes.

It is important to observe licencing restrictions when making fonts available for download in this way. The CSS mechanism doesn’t contain any restrictions related to font licences, but there are ways of preparing fonts for download that take into consideration some aspects of this issue – though not enough to provide a watertight restriction on font usage.

Microsoft makes available a program to create .eot fonts from ordinary true/opentype fonts. Eot font files can apply some usage restrictions and also subset the font to include only the characters used on the page. The subsetting feature is useful when only a small amount of text appears in a given font, but for a whole page in, say, devanagari script it is of little use – particularly if the user is to input text in forms. The biggest problem with .eot files, however, is that they are only supported by Internet Explorer, and there are no plans to support .eot format on other browsers.

The W3C is currently working on the WOFF format. Fonts converted to WOFF format can have some gentle protection with regard to use, and also apply significant compression to the font being downloaded. WOFF is currently only supported by Firefox, but all other major browsers are expected to provide support for the new format.

For this to work well, all browsers must support the same type of font download.

Beyond fonts

Complex scripts, such as those used for Indic and South East Asian languages, need to choose glyph shapes and positions and substitute ligatures, etc. according to the context in which characters are used. These adjustments can be acoomplished using the features of OpenType fonts. The browser must be able to implement those opentype features.

Often a font will also rely on operating system support for some subset of the complex script rendering. For example, a devanagari font may rely on the Windows uniscribe dll for things like positioning of left-appended vowel signs, rather than encoding that behaviour into the font itself. This reduces the size and complexity of the font, but exposes a problem when using that font on a variety of platforms. Unless the operating system can provide the same rendering support, the text will look only partially correct. Mobile devices must either provide something similar to uniscribe, or fonts used on the mobile device must include all needed rendering features.

Browsers that do font linking must also support the necessary opentype features and obtain functionality from the OS rendering support where needed.

If tools are developed to subset webfonts, the subsetting must not remove the rendering logic needed for correct display of the text.

Characters in the Unicode Bengali block.

If you’re interested, I just did a major overhaul of my script notes on Bengali in Unicode. There’s a new section about which characters to use when there are multiple options (eg. RRA vs. DDA+nukta), and the page provides information about more characters from the Bengali block in Unicode (including those used in Bengali’s amazingly complicated currency notation prior to 1957).

In addition, this has all been squeezed into the latest look and feel for script notes pages.

The new page is at a new location. There is a redirect on the old page.

Hope it’s useful.

>> Read it


>> Read it !

Picture of the page in action.

I finally got to the point, after many long early morning hours, where I felt I could remove the ‘Draft’ from the heading of my Myanmar (Burmese) script notes.

This page is the result of my explorations into how the Myanmar script is used for the Burmese language in the context of the Unicode Myanmar block. It takes into account the significant changes introduced in Unicode version 5.1 in April of this year.

Btw, if you have JavaScript running you can get a list of characters in the examples by mousing over them. If you don’t have JS, you can link to the same information.

There’s also a PDF version, if you don’t want to install the (free) fonts pointed to for the examples.

Here is a summary of the script:

Myanmar is a tonal language and is syllable-based. The script is an abugida, ie. consonants carry an inherent vowel sound that is overridden using vowel signs.

Spaces are used to separate phrases, rather than words. Words can be separated with ZWSP to allow for easy wrapping of text.

Words are composed of syllables. These start with an consonant or initial vowel. An initial consonant may be followed by a medial consonant, which adds the sound j or w. After the vowel, a syllable may end with a nasalisation of the vowel or an unreleased glottal stop, though these final sounds can be represented by various different consonant symbols.

At the end of a syllable a final consonant usually has an ‘asat’ sign above it, to show that there is no inherent vowel.

In multisyllabic words derived from an Indian language such as Pali, where two consonants occur internally with no intervening vowel, the consonants tend to be stacked vertically, and the asat sign is not used.

Text runs from left to right.

There are a set of Myanmar numerals, which are used just like Latin digits.

So, what next. I’m quite keen to get to Mongolian. That looks really complicated. But I’ve been telling myself for a while that I ought to look at Malayalam or Tamil, so I think I’ll try Malayalam.

I’m sitting here watching a video of Timbl talking on a BBC news page and I suddenly realised how good this was.

The page design helps give the impression – there are no clunky boxes around the video itself – but there’s also no need to view in a different area, or switch to another tool, or even wait for a download to get started – it’s just there as part of the page, but a part that moves and produces sound. Kind of like the moving paper in Harry Potter’s world.

It’s great how technology marches on sometimes.

[Update: Since I wrote the above the video has acquired grey panels around the edges for controls, which I think is a shame. It's still pretty good technology though. ]

This post is about the dangers of tying a specification, protocol or application to a specific version of Unicode.

For example, I was in a discussion last week about XML, and the problems caused by the fact that XML 1.0 is currently tied to a specific version of Unicode, and a very old version at that (2.0). This affects what characters you can use for things such as element and attribute names, enumerated lists for attribute values, and ids. Note that I’m not talking about the content, just those names.

I spoke about this at a W3C Technical Plenary some time back in terms of how this bars people from using certain aspects of XML applications in their own language if they use scripts that have been added to Unicode since version 2.0. This includes over 150 million people speaking languages written with Ethiopic, Canadian Syllabics, Khmer, Sinhala, Mongolian, Yi, Philippine, New Tai Lue, Buginese, Cherokee, Syloti Nagri, N’Ko, Tifinagh and other scripts.

This means, for example, that if your language is written with one of these scripts, and you write some XHTML that you want to be valid (so you can use it with AJAX or XSLT, etc.), you can’t use the same language for an id attribute value as for the content of your page. (Try validating this page now. The previous link used some Ethiopic for the name and id attribute values.)

But there’s another issue that hasn’t received so much press – and yet I think, in it’s own way, it can be just as problematic. Scripts that were supported by Unicode 2.0 have not stood still, and additional characters are being added to such scripts with every new Unicode release. In some cases these characters will see very general use. Take for example, the Bengali character U+09CE BENGALI LETTER KHANDA TA.

With the release of Unicode 4.1 this character was added to the standard, with a clear admonition that it should in future be used in text, rather than the workaround people had been using previously.

This is not a rarely used character. It is a common part of the alphabet. Put Bengali in a link and you’re generally ok. Include a khanda ta letter in it, though, and you’re in trouble. It’s as if English speakers could use any word in an id, as long as it didn’t have a ‘q’ in it. It’s a recipe for confusion and frustration.

Similar, but much more far reaching, changes will be introduced to the Myanmar script (used for Burmese) in the upcoming version 5.1. Unlike the khanda ta, these changes will affect almost every word. So if your application or protocol froze its Unicode support to a version between 3.0 and 5.0, like IDNA, you will suddenly be disenfranchising Burmese users who had been perfectly happy until now.

Here are a few more examples (provided by Ken Whistler) of characters added to Unicode after the initial script adoption that will raise eyebrows for people who speak the relevant language:

  • 01F6 LATIN SMALL LETTER N WITH GRAVE: shows up in NFC pinyin data for Chinese.
  • 0219 LATIN SMALL LETTER S WITH COMMA BELOW: Romanian data.
  • 0450 CYRILLIC SMALL LETTER IE WITH GRAVE: Macedonian in NFC.
  • 0653..0655 Arabic combining maddah and hamza: Implicated in NFC normalization of common Arabic letters now.
  • 0972 DEVANAGARI LETTER CANDRA A: Marathi.
  • 097B DEVANAGARI LETTER GGA: Sindhi.
  • 0B35 ORIYA LETTER VA: Oriya.
  • 0BB6 TAMIL LETTER SHA: Needed to spell sri.
  • 0D7A..0D7F Malayalam chillu letters: Those will be ubiquitous in Malayalam data, post Unicode 5.1.
  • and a bunch of Chinese additions.

So the moral is this: decouple your application, protocol or specification from a specific version of the Unicode Standard. Allow new characters to be used by people as they come along, and users all around the world will thank you.

Next Page »