Search Results for 'uniview'
Posted on Wed 19 Aug 2009 under general, i18n, utilities, web
>> Use it

I have added a bunch of additional new features to my lookup tool to help with choosing language tags. There is additional information available when you look up subtags (such as what to use if the subtag is deprecated, and what subtags macrolanguages enclose, etc.), and more tests on well-formedness with clearer explanations of the problem. Example.
This should make it a lot more useful to people who haven’t read BCP 47 and want to create language tags. Hopefully, in a short while, I’ll also write and link to an article that describes how to use subtags from the ground up in a procedural way, that will complement the tool.
For further assistance, you can now link from a language subtag result to the SIL Ethnologue, to make it easier to check whether that subtag really does refer to the language you were thinking of.
In addition, script subtag results link to Unicode blocks in UniView.
Posted on Fri 31 Jul 2009 under general, i18n, utilities, web
>> See what it can do
>> Use it

Following hot on the heels of the last release come some further significant changes to UniView aimed at making it easier to use as Unicode grows.
The big change is that UniView now starts up in graphics mode by default. This means that pages load more slowly, but (especially with the continuing growth of Unicode) also means that you are more likely to be able to see the characters you are looking for. It’s easy to switch between modes at any point, using the “Use graphics” checkbox. (And if you preferred font glyphs as a default, you just need to change the URI in your bookmarked link slightly, and you can continue to work that way.)
To facilitate this change, I created my own graphics for a number of blocks which are not yet covered by decodeunicode, or which are no longer fully covered by decodeunicode. The blocks for which I provided graphics are Latin Extended-C, Latin Extended-D, Latin Extended Additional, Cyrillic Supplement, Cyrillic Extended-B, Modifier Tone Letters, Tibetan, Malayalam, Saurashtra, Ol Chiki, Myanmar, Kayah Li, Cham, Rejang, Vai, Supplemental Punctuation, and Miscellaneous Symbols and Arrows.
There are still many characters for which there are no graphics (especially the new characters in Unicode 5.2), but coverage is much better than it was. As I find more fonts, I will be able to create graphics for the remaining characters.
I also put a grey box around the characters in tables. This is particularly useful if there are no graphics or font glyphs for a block or range of characters, as it makes it easier to locate the character you are looking for.
I also fixed a bug that was preventing Chrome and Safari and IE from displaying the first two Latin blocks. I think the bug was actually in the Unicode data file.
Posted on Mon 27 Jul 2009 under general, i18n, utilities, web
>> See what it can do
>> Use it

With the family now in Japan, I had some extra time to spare this weekend, so I upgraded UniView to handle all the proposed characters for Unicode 5.2.
While the properties for new and modified characters are still in beta they are not officially stable, however the character allocations should be stable at this point. UniView therefore alerts you if you are looking at a new character.
If the Unicode database information has changed for a given character you are also warned, and provided with a link that points to the previous information for that character. These warnings will be removed from UniView when Unicode 5.2 is released.
Of course, you are unlikely to be able to actually see the new characters themselves, unless you are lucky enough to have a very new font to hand. The graphic alternatives are not available yet for these characters. I’m wondering whether it’s possible for me to do something about that, but that will take a little longer. In the meantime, you might find it more useful to view blocks in list view. (Click on ‘Show range as list’).
This release also fixes a few small bugs in the HTML and JavaScript code.
Posted on Sat 14 Feb 2009 under general, i18n, utilities, web
>> See what it can do !
>> Use it !

The major changes in this version include a new feature to normalise text as NFC or NFD, the ability to accept decimal code point values, and an overhaul of top part of the user interface.
Added buttons to the Text area to allow conversion of the text to NFC or NFD normalization forms. (You may not notice the change until you list the characters.)
The control panel was also substantially rearranged again to hopefully make it easier for newcomers to see what they can do.
The Code point conversion feature was upgraded to handle decimal code point values.
A single character in the codepoints area or text area is now listed in the lower left panel when you click on
, rather than in the right-hand properties panel. This is to improve consistency and avoid surprises.
Added a link to the CLDR property demo from the right panel to give access to additional properties.
Improved the parsing of codepoints when surrounded by text in the Code point input field, so that it now works with &#x…; and \u… and \U… escapes.
Jettisoned some unneeded code to reduce download by around 40-50K bytes. Implemented the NFC/NFD feature using AJAX, to avoid putting the download size back up.
When you delete the contents of the text area or the code point area, the associated input field is given focus, so you are ready for input.
A couple more minor bug fixes.
Posted on Wed 4 Feb 2009 under code notes, general, i18n, utilities, web
I was asked to make available the code for my normalization functions in JavaScript and PHP. The links are below. I’m making the code available under a Creative Commons Attribution-Noncommercial-Share Alike licence.
Disclaimers Note that I make no claim to have produced polished, compact or well-optimised code! The code does what I need, and I’m happy with that. You are welcome to suggest improvements, and I’m sure there are many that could be made.
As they say, this code is made available in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
The code is a little more convoluted that it ought to be, to get around the fact that JavaScript doesn’t understand supplementary characters, and PHP just doesn’t naturally understand Unicode. (How I long for PHP6.)
Update: [[I meant to mention that there is a way of doing normalization in PHP already. I made this code available just because I had it. I created it as a learning exercise. It may be useful, however, if you are unable to load the ICU and intl packages onto your server.]]
To use the code, simply call nfc('your-text-string') or nfd('your-text-string') from your code and capture the result.
For PHP you’ll need these routines and this data.
For JavaScript look at these routines and this data. There is also a lite version of the data file that doesn’t include Han characters. I use this sometimes for bandwidth savings (about 14K less).
Test files I also created some test files for PHP and for JavaScript.
Both of these expect to find a copy of http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt in the local directory. These files run 71,076 tests.
Cautions Be careful about the editor you use for the data files. I spent several hours fruitlessly debugging the routines, only to find that Notepad++ was displaying certain supplementary characters ok, but corrupting them on save. I switched to Notepad and the problem evaporated. And I probably don’t need to add that editing the data files in something like DreamWeaver is a bad idea because it will probably normalize the data before saving.
Another point: you may see Unicode replacement characters at a couple of points in the PHP source. These represent the first and last characters in the high surrogate range.
Experimenting If you want to play with something that uses this you could try my Tłįchǫ (Dogrib) character picker, or my Normalizer tool. I will slowly fit this to all the pickers and to UniView. I have a local version of UniView waiting in the wings that uses the PHP files via AJAX, to reduce download size. For that you need a file that returns the result as plain text across the wire, such as this.
Well, I hope that that may be of use to someone, somewhere. I hope I haven’t forgotten anything.
Posted on Wed 14 Jan 2009 under general, i18n, utilities, web
>> See what it can do !
>> Use it !

The major changes in this version relate to the way searching and property-based lookup is done on characters in the lower left panel, and features for refining and capturing the resulting lists.
Removed the two Highlight selection boxes. These used to highlight characters in the lower left panel with a specific property value. The Show selection box on the left (used to be Show list) now does that job if you set the Local checkbox alongside it. (Local is the default for this feature.)
As part of that move, the former SiR (search in range) checkbox that used to be alongside Custom range has been moved below the Search for input field, and renamed to Local. If Local is checked, searching can now be done on any content in the lower left panel, and the results are shown as highlighting, rather than a new list.
To complement these new highlighting capabilities, a new feature was added. If you click on the icon next to Make list from highlights the content of the lower left panel will be replaced by a list of just those items that are currently highlighted – whether the highlighting results from a search or a property listing. Note that this can also be useful to refine searches: perform an initial search, convert the result to a list, then perform another search on that list, and so on.
Finally got around to putting
icons after the pull-down lists. This means that if you want to reapply, say, a block selection after doing something else, only one click is needed (rather than having to choose another option, then choose the original option). The effect of this on the ease of use of UniView is much greater than I expected.
Added an icon
to the text area. If you click on this, all the characters in the lower left panel are copied into the text area. This is very useful for capturing the result of a search, or even a whole block. Note that if a list in the lower left panel contains unassigned code points, these are not copied to the text area.
As a result of the above changes, the way Show as graphics and Show range as list work internally was essential rewritten, but users shouldn’t see the difference.
Changed the label Character area to Text area.
Posted on Wed 7 Jan 2009 under general, i18n, utilities, web
>> See what it can do !
>> Use it !

The main change in this version is the reworking of the former Cut & paste and Code point(s) fields to make it easier to use UniView as a generalised picker.
Moved the cut&paste field downwards, made it larger, and changed the label to character area. This should make it easier to deal with text copy/cut & paste, and more obvious that that is possible with UniView. It is much clearer now that UniView provides character map/picker functionality, and not just character lookup.
Whereas previously you had to double-click to put a character in the lower left pane into the Cut&paste field, UniView now echoes characters to the Character area every time you (single) click on a character in the lower left hand pane. This can be turned off. Double-clicking will still add the codepoint of a character in the lower left panel to the Code points field.
The Character area has its own set of icons, some of which are new: ie. you can select the text, add a space, and change the font of the text in the area (as well as turn the echo on and off). I also spruced up the icons on the UI in general.
Note that on most browsers you can insert characters at the point in the Character area where you set the cursor, or you can overwrite a highlight range of characters, whereas (because of the non-standard way it handles selections and ranges) Internet Explorer will always add characters to the end of the line.
The Code points field has also been enlarged, and I moved the Show list pull-down to the left and Show as graphics and Show page as list to the right. This puts all the main commands for creating lists together on the left.
When you mouse over character in the lower left pane you now see both hex and decimal codepoint information. (Previously you just saw an unlabelled decimal number.) You will also find decimal code point values for characters displayed in the lower right panel.
Fixed a bug in the Code points input feature so that trailing spaces no longer produce errors, but also went much further than that. You can now add random text containing codepoints or most types of hex-based escaped characters to the input field, and UniView will seek them out to create the list. For example, if you paste the following into the Code points field:
the decomposition mapping is <U+CE20, U+11B8>, and not <U+110E, U+1173, U+11B8>.
the result will be:
CE20: 츠 [Hangul Syllables]
11B8: ᆸ HANGUL JONGSEONG PIEUP
110E: ᄎ HANGUL CHOSEONG CHIEUCH
1173: ᅳ HANGUL JUNGSEONG EU
11B8: ᆸ HANGUL JONGSEONG PIEUP
Of course, UniView is not able to tell that an ordinary word like ‘Abba’ is not a hex codepoint, so you obviously need to watch out for that and a few other situations, but much of the time this should make it much easier to extract codepoint information.
I still haven’t found a way to fix the display bug in Safari and Google Chrome that causes initial content in the lower left pane to be only partially displayed.
Posted on Sat 1 Nov 2008 under general, i18n, utilities, web
>> See what it can do !
>> Use it !

A large amount of code was rewritten to enable data to be downloaded from the server via AJAX at the point of need. This eliminates the long wait when you start to use UniView without the database information in your cache. This means that there is a slightly longer delay when you view a new block, but the code is designed so that if you have already downloaded data, you don’t have to retrieve it again from the server.
The search mechanism was also rewritten. The regular expressions used must now be supported in both JavaScript and PHP (PHP is used if not searching within the current range). When ‘other’ is ticked, the search will look in the alternative name fields, but not in other property settings (so you can no longer use something like ;AL; to search for characters with a particular property. (Use ‘Show list’ instead.))
Removed several zero-width space characters from the code, which means that UniView now works with Google Chrome, except for some annoying display bugs that I’m not sure how to fix – for example, the first time you try to display any block you only seem to get the top line (although, if you click or drag the mouse, the block is actually there). This seems to be WebKit related, since it happens in Safari, too.
Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.
Posted on Mon 7 Apr 2008 under general, i18n, utilities, web
>> See what it can do !
>> Use it !

Those of you who have used UniView over the last couple of days will have seen that it now supports Unicode 5.1. All Unicode 5.1 character information is available, however you will only be able to see the new characters if you have fonts that cover them. The decodeunicode graphics for the new characters are not yet available.
Last night I also fixed a long-running bug that had meant that additional information available in my character database was not accessible in Internet Explorer (due to AJAX issues). (See the related post if you are interested in the code).
There are no other changes at this time (though those two are pretty significant).
Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.
Posted on Sun 6 Apr 2008 under code notes, general, web
Some code I put together to import some XML retrieved via AJAX into a document (stored here so I can find it again in the future).
IE won’t let you import a cloned nodeset into a document, so I wrote this for my UniView utility. The code starts with a node in the AJAX data and creates a copy of all elements and attributes in the current document.
function copyNodes (ajaxnode, copiednode) {
for (var node=ajaxnode.firstChild; node != null; node = node.nextSibling) {
if (node.nodeType == 3){ //text
copiednode.appendChild(document.createTextNode(node.data));
}
if (node.nodeType == 1){ //element
var subnode = document.createElement(node.nodeName);
var attlist = node.attributes;
if (attlist != null) {
for (var i=0; i<attlist.length; i++){
subnode.setAttribute(attlist[i].name, attlist[i].value);
}
}
copiednode.appendChild(subnode);
copyNodes(node, subnode);
}
}
}
It doesn’t expect processing instructions, comments etc. Just elements and attributes. (Though of course that can be added, if needed.)
Posted on Sun 9 Mar 2008 under general, i18n, utilities, web
>> See what it can do !
>> Use it !

While we await Unicode 5.1, here is another update to UniView that provides a bunch of additional useful features and fixes a few bugs.
Changes include:
- Changed the custom range input to a single field that will accept various range formats. This makes it easier to cut and paste or drag and drop ranges into the input field. The Custom range field will accept various formats.
- The numbers must be in hexadecimal form and separated by a colon (the default), a hyphen, one or more spaces, or one or more periods. There must be only two numbers. The numbers can be in the following formats: 1234, ሴ, Ӓ, \u1234, U+1234. The actual number of hex digits can be between 1 and 6.
- Added the ability to select whether Search looks at any combination of character names only, other parts of a record in the Unicode database, or the other character description information, and added a message to say how many characters were matched.
- Added the ability to search within the range specified in the field entitled Range.
- Added the ability to list characters with a given General or Bidirectional property (within a specified range or not).
- Added an AJAX link to my database of information about Unicode characters. If enabled, using the DB checkbox, this automatically retrieves any available data for a character when information about that character is displayed in the lower right panel. You can also specify that UniView should open with that set as the default using
database=on in the URI used to call UniView.
- Because of the previous improvement, I removed the ability to link in a file of information about characters. (The information in the files was a copy of the information in the database.)
- Moved the Code point(s) and Cut & paste fields lower, to make them easier to use.
- Fixed a bug that was preventing the Search function finding characters in the Basic Latin block.
- Bugfix: a range like 0036:0067 will always show full rows now; a range with start higher than end will show alert.
- Added reference to decodeunicode when graphics are displayed in left column
- Bugfix: search parameter won’t break when graphics etc toggled
- You can now specify windowHeight parameter at startup in the URI’s query string.
Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.
Posted on Thu 29 Nov 2007 under general, i18n, web, writings
Tim Greenwood just pointed out to me a ‘bug’ in my converter program, which I think is actually, in my mind, a bug in Firefox (although I imagine it was implemented by someone as a feature).
If you type A0 (the hex code for a non-breaking space) in the Hexadecimal code points field, then press Convert, you will get a blank space in the Characters field that should be U+00A0 NO-BREAK SPACE. Then press Convert or View Names above this Characters field and you’ll find that what was supposed to be a NBSP has changed into an ordinary space. IE7, Opera and Safari all continue to show the character in the field as a NBSP.
(However, all four browsers substitute an ordinary space when you copy and paste the text from the Characters field into something else.)
I tried this with a range of other types of space , but had no such behaviour (try it). They all remained themselves.
Anyone know what this is about?
Posted on Sun 28 Oct 2007 under general, i18n, utilities, web
>> Use it !

This web-based tool helps you convert between a number of Unicode escape and code formats.
Changes in the new version:
- Convert from JavaScript, Java and C escape notation, and to JavaScript/Java escapes (with switch to show C-style supplementary characters)
- Convert to and from CSS escape notation
- Convert from HTML/XML code with escapes to code with just characters
- Convert < > ” or & in HTML/XML code to entities
- Option to show ASCII characters when converting to NCRs
- View a set of characters in UniView by clicking on the View in UniView button
For CSS output I chose the 6-figure version with no optional space, since I thought it was clearest. I’ve had a request to change it to the shortest form (4 or 6 figures) followed by space. If other people prefer that, I may change it.
Update: Markus Scherer convinced me to change the CSS output. So rather than 6-figure escapes with no space, the output now contains 6-figure escapes followed by a space for supplementary characters, and 4-figure escapes followed by a space elsewhere.
Posted on Sun 14 Oct 2007 under general, i18n, utilities, web
>> See what it can do !
>> Use it !

I found a little more time to work on UniView while flying to the US for the I18n & Unicode Conference yesterday, adding a bunch of additional useful features.
Changes include:
- Extended the ability to open UniView with data displayed from a URI. In addition to specifying a block and a character, you can now specify a range, a list of codepoints, a list of characters, or a search string. This is useful for pointing people to results using URIs in links or email.
- Switching between graphics or fonts for display of characters now refreshes the right panel also.
- Clicking on the information about the script group of a character displayed in the right panel will cause that block to be displayed in the left panel. This is particularly useful when you find a single character and want to know what’s around it.
- Replaced the use of hyphens to specify block names in URI queries with underscores or %20. This may break some existing URIs, but fixes a bug that meant that block names that actually contain hyphens were not displaying.
- Added an option to the right hand panel to display the current character in the Unicode Conversion tool.
- Fixed some other bugs related to specifying Basic Latin block in a URI.
- Reinstated CJK Unified Ideographics and Hangul Syllables in the block selection pull-down, but added a warning and opt out if the block you are about to display contains more than 2000 characters. Also added warning and opt out if you try to specify a range of over 2000 characters.
Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.
Posted on Mon 8 Oct 2007 under general, i18n, utilities, web
>> See what it can do !
>> Use it !

In little pockets of time recently I’ve been making some significant improvements to my UniView tool, the character map on steroids.
Changes include:
- Substantially revised the code so that handling of ideographic and hangul characters and other characters not in the Unidata database is much improved. For example, ideographs now display in the left panel for a specified range and property values are available in the right panel.
- Added regular expression support to the search input field.
- Changes to the user interface: moved highlighting controls to the initial screens and move others, such as the chart numbering toggle, to the submenu under “Options”; provided wider input fields for codepoint and cut&paste input; replaced the graphics and list toggle icons with checkboxes; provided an icon to quickly clear the contents of the codepoint and cut&paste input fields. A link to the UniHan database was added alongside the Cut & paste input field: when clicked, this icon looks up the first character in either field. A link to the UniHan database was also added to the right panel when a Unified CJK character is displayed there.
- The Codepoint input field now accepts more than one codepoint (separated by spaces).
- When you double-click on a character in the left panel the codepoint is appended to the Codepoint input field as well as adding the character to the Cut & paste field.
- When you click in the checkbox Show as graphics the change is immediately applied to whatever is in the left panel. It no longer redisplays the range if you are looking at, say, a list of characters generated by the Codepoint input, but redisplays the same list.
- Set the default font to “Arial Unicode MS, sans-serif”.
- Added a message for those who do not have JavaScript turned on, and messages to please wait while data is being downloaded on initial startup.
- Fixed the icons linking to the converter tool, so that the contents of the adjacent field are passed to the converter and converted automatically.
- Added links in the right panel to FileFormat pages (in addition to decodeUnicode). The FileFormat pages provide useful information for Java and .Net users about a given character.
- Removed the option to specify your own character notes (I’m not aware that anyone ever did, since it hasn’t worked for a while now and no-one has complained). This is because AJAX technology will not allow an XML file to be included from another domain. When that is fixed I will reinstate it.
- Fixed a number of other bugs, particularly related to supplementary character support and highlighting.
Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.
Posted on Thu 27 Sep 2007 under general, i18n, utilities, web
>> Use it !

This web-based tool helps you convert between Unicode character numbers, characters, UTF-8 and UTF-16 code units in hex, percent escapes, Unicode U+hex notation, and Numeric Character References (hex and decimal).
Changes in the new version:
- Convert to and from Unicode U+hex notation
- Get a list of Unicode names for a sequence of characters by clicking on the View Names button
- You now have to click a button to start the conversion, rather than remove focus from the input area. This provides better control and a more intuitive approach.
It also allows you to separate a sequence of characters by spaces. Paste the characters into the Characters field and click Convert. Then click Convert immediately in the Unicode U+hex notation field. (The latter field is the only one that changes the data after an initial conversion.)
Posted on Wed 19 Sep 2007 under general, i18n, web
decodeunicode is a wiki that allows people to make notes on Unicode characters. I link to it from my UniView web app.
They just announced that they now support Unicode 5.0. That means that they have pages for and images of all 98,884 graphic characters (having recently added 45,000 new gifs in three sizes).
They also welcomed their one millionth unique visitor in June.
They improved the links too, so that you can now link easily to a specific character or group. For example:
http://www.decodeunicode.org/en/combining_diacritical_marks or http://www.decodeunicode.org/en/cuneiform
and
http://www.decodeunicode.org/en/u+0e33 or http://www.decodeunicode.org/en/u+1d037
Great!
Posted on Mon 9 Jul 2007 under css, general, i18n, web, writings
Christopher Fynn of the National Library of Bhutan raised an interesting question on the W3C Style and I18n lists. Tibetan emphasis is often achieved using one of two small marks below a Tibetan syllable, a little like Japanese wakiten. The picture shows U+0F35: TIBETAN MARK NGAS BZUNG NYI ZLA in use. The other form is 0F37: TIBETAN MARK NGAS BZUNG SGOR RTAGS.
Chris was arguing that using CSS, rather than Unicode characters, to render these marks could be useful because:
- the mark applies to, and is centred below a whole ’syllable’ – not just the stack of the syllable – this may be easier to achieve with styling than font positioning where, say, a syllable has an even number of head characters (see examples to the far right in the picture)
- it would make it easier to search for text if these characters were not interspersed in it
- it would allow for flexibility in approaches to the visual style used for emphasis – you would be able to change between using these marks or alternatives such as use of red colour or changes in font size just by changing the CSS style sheet (as we can for English text).
There are of potential issues with this approach too. These include things like the fact that the horizontal centring of glyphs within the syllable is not trivial. The vertical placement is also particularly difficult. You will notice from the attached image that the height depends on the depth of the text it falls below. On the other hand, it isn’t easy to achieve this with diacritics either, given the number of possible permutations of characters in a syllable. Such positioning is much more complicated than that of the Japanese wakiten.
A bigger issue may turn out to be that the application for this is fairly limited, and user agent developers have other priorities – at least for commercial applications.
To follow along with, and perhaps contribute to, the discussion follow the thread on the style list or the www-international list.
Posted on Sat 21 Oct 2006 under general, i18n, utilities, web
Start the app
This dynamic HTML app helps you convert between Unicode character numbers, characters, UTF-8 and UTF-16 code units in hex, percent escapes, and Numeric Character References (hex and decimal).
This new version adds some useful things:
- You can now convert to and from percent escaped forms. When converting to percent escapes, characters allowed in URI syntax are not converted. When converting from percent escapes you can only use characters allowed in URIs.
- You can also now convert from a mixture of characters and escapes in the bottom two fields.
Posted on Wed 18 Oct 2006 under general, i18n, web, writings
I got an email this morning asking for some use cases for the CSS :lang selector. Here are some ideas. This should help content authors understand how using :lang can sometimes be better than other approaches when selecting content for styling. Of course, not all user agents support :lang, and hopefully these use cases will also show how enabling support could be useful.
Use case 1
One of the main cases where I want to use :lang is when I have a page that includes numerous short pieces of text in a different script. Take, for example, my notes on the Myanmar script. In such cases I want to assign a particular font and perhaps font-size, etc, to the numerous Myanmar examples.
It does my head in trying to ensure that I labelled all the myanmar text with class attributes so that I get the right font and colour applied. And it’s frustrating, because all I’m doing is repeating information that’s there already in the lang attribute (and in the xml:lang attribute too, given that this is xhtml).
Adding class="my" everywhere also bulks up the document. Even in this smallish document, it adds over 1K to the page size.
It would make life a lot easier to just include a single CSS rule:
:lang(my) { font-family: myanmar1, sans-serif; color:red; font-size: 130%; }
Use case 2
Suppose you have the following Japanese text in an English document:
<blockquote lang=”ja” xml:lang=”ja”>ワールド・ワイド・ウェッブを<em>世界中</em>に広げましょう</blockquote>
Now suppose you want to apply different emphasis styling to the Japanese text, since italicisation doesn’t work well for ideographic scripts in small font sizes. Let’s suppose we wanted to add the proposed wakiten emphasis style that CSS3 describes. How do you make that happen?
Well, ideally, you’d just add the following rule to your CSS, and all would be taken care of:
em:lang(ja) { font-emphasize: dot before; font-style: normal; }
(”When you encounter an em tag and the language is Japanese use wakiten and remove the italics.”)
If you’re dealing with IE6 :lang is not supported, and you’d actually have to add a special class to each and every emphasis tag embedded in Japanese text and use a rule such as
em.ja { ... }
How annoying is that!
IE7 CR1 supports the CSS selectors lang |= and lang =. Aha! you might think, problem solved. We can use the following rule:
em[lang |= 'ja'] { ... }
But you’d be wrong. This only works if the language is declared on the em element itself. So you’d still have to go through and add lang="ja" xml:lang="ja" to each em element – even though you have already declared that the whole blockquote is in Japanese!
Use case 3
This use case is slightly less mainstream, but I think it presents a slightly different use case, but one which is increasingly common with the increase in multilingual blogs and AJAX powered pages. It applies when you include text into a page that comes from another environment, either by cut & paste, or by automatic means, and you don’t have the styling information that was associated with it originally.
Assuming that the text has language attributes, or that you can apply those, you could have a set of default rules in your environment that, say, apply a nastaliq font with a percentage size scaling factor to all text in Urdu, so that it has some styling at least, and is a reasonable size relative to the Latin text.
For example, if I cut and paste some Urdu text into this blog, it could make the difference between seeing this:

and this:
Adding, once, a couple of rules in your blog css that say:
:lang(ur) { font-family: standardMSUrdufont, standardMacUrdufont, standardUnixUrdufont, serif; font-size: 140%; }
em:lang(ur) { font-weight: bold; font-style: normal; }
would be preferable to having to add extra inline markup to the text as you add it to your blog each time.
As a similar example, I just released the latest version of the UniView tool (a kind of web-based Character Map on steroids). It includes a facility that allows you to write your own notes about characters in a separate document and see the relevant notes when looking up a specific character. The information is sucked in using AJAX features. See [1].
We do not at the moment try to incorporate/recognize the other document’s style rules when the notes are displayed in UniView, however, while keeping things simple, it may be useful to allow the UniView user switch on or off some very general default style rules specifying fonts and/or font sizing to text marked up for a particular language.
As long as the code is marked up for language, such defaults can be applied regardless of what class names or styling appeared in the original document. Of course, :lang would be very useful in this respect.
[1] To see this example
a. open UniView
b. where it says “Select a range to display” select Myanmar
c. click on character 1004 and see the description on the right
d. now click on the icon with a + sign between Notes: and Search string: fields
e. from the menu select Myanmar block and say ok, and dismiss the pop up
f. now click on character 1004 again, and see the notes added to the description on the right – these notes came from an XML file (see the same file served as xhtml)
(Anyone can write such a document, stick it on a server and include its information in UniView. The only requirement is that the notes you want to appear be surrounded by <div class=”notes” id=”C[hexCodepoint]“></div>. The example above is one such file supplied with UniView.)
Other useful stuff
At the W3C Internationalization site you can find:
- an article that answers the question: “What is the most appropriate way to associate CSS styles with text in a particular language in a multilingual XHTML/HTML document?“
- a set of test pages relating to user agent support of :lang, lang|= and lang= and a fairly recent summary of results