Removed the ‘beta’ from the version number and replaced with .0.1. New version converts u+… (ie. lowercase u) as well as U+… now.

See http://rishida.net/tools/conversion/

Thanks to Martin Dürst for the suggestion.

>> Use it

Picture of the page in action.

I have added a bunch of additional new features to my lookup tool to help with choosing language tags. There is additional information available when you look up subtags (such as what to use if the subtag is deprecated, and what subtags macrolanguages enclose, etc.), and more tests on well-formedness with clearer explanations of the problem. Example.

This should make it a lot more useful to people who haven’t read BCP 47 and want to create language tags. Hopefully, in a short while, I’ll also write and link to an article that describes how to use subtags from the ground up in a procedural way, that will complement the tool.

For further assistance, you can now link from a language subtag result to the SIL Ethnologue, to make it easier to check whether that subtag really does refer to the language you were thinking of.

In addition, script subtag results link to Unicode blocks in UniView.

>> Use it

Picture of the page in action.

The IANA Subtag Registry has been recently updated to contain 220 extlang subtags and the ISO 639-3 language subtags, taking the total number of subtags to almost 8,000.

I have produced a new version of my lookup tool to help with language tagging. In addition to helping you find subtags and lookup the meaning of subtags, it now helps check the well-formedness of a language tag.

The tool provides access to all currently defined subtags, including the new extlang subtags.

Parsing language tags. In addition to trying to make the user interface more friendly, I also added the ability to parse hyphenated tags and discover their structure and check for errors. I’m not claiming with this release that the new parser field tests all the corner cases, but it should provide reports for most of the typical errors.

It reports errors for the following:

- subtags that are not in the registry (by type)
- incorrectly ordered subtags
- duplicate variant tags and multiple tags of other types
- overlong private use subtags

Try this example.

It doesn’t yet handle extensions, but then there aren’t any valid ones to handle yet anyway.

I hope that’s useful.

>> See what it can do

>> Use it

Picture of the page in action.

Following hot on the heels of the last release come some further significant changes to UniView aimed at making it easier to use as Unicode grows.

The big change is that UniView now starts up in graphics mode by default. This means that pages load more slowly, but (especially with the continuing growth of Unicode) also means that you are more likely to be able to see the characters you are looking for. It’s easy to switch between modes at any point, using the “Use graphics” checkbox. (And if you preferred font glyphs as a default, you just need to change the URI in your bookmarked link slightly, and you can continue to work that way.)

To facilitate this change, I created my own graphics for a number of blocks which are not yet covered by decodeunicode, or which are no longer fully covered by decodeunicode. The blocks for which I provided graphics are Latin Extended-C, Latin Extended-D, Latin Extended Additional, Cyrillic Supplement, Cyrillic Extended-B, Modifier Tone Letters, Tibetan, Malayalam, Saurashtra, Ol Chiki, Myanmar, Kayah Li, Cham, Rejang, Vai, Supplemental Punctuation, and Miscellaneous Symbols and Arrows.

There are still many characters for which there are no graphics (especially the new characters in Unicode 5.2), but coverage is much better than it was. As I find more fonts, I will be able to create graphics for the remaining characters.

I also put a grey box around the characters in tables. This is particularly useful if there are no graphics or font glyphs for a block or range of characters, as it makes it easier to locate the character you are looking for.

I also fixed a bug that was preventing Chrome and Safari and IE from displaying the first two Latin blocks. I think the bug was actually in the Unicode data file.

>> See what it can do

>> Use it

Picture of the page in action.

With the family now in Japan, I had some extra time to spare this weekend, so I upgraded UniView to handle all the proposed characters for Unicode 5.2.

While the properties for new and modified characters are still in beta they are not officially stable, however the character allocations should be stable at this point. UniView therefore alerts you if you are looking at a new character.

If the Unicode database information has changed for a given character you are also warned, and provided with a link that points to the previous information for that character. These warnings will be removed from UniView when Unicode 5.2 is released.

Of course, you are unlikely to be able to actually see the new characters themselves, unless you are lucky enough to have a very new font to hand. The graphic alternatives are not available yet for these characters. I’m wondering whether it’s possible for me to do something about that, but that will take a little longer. In the meantime, you might find it more useful to view blocks in list view. (Click on ‘Show range as list’).

This release also fixes a few small bugs in the HTML and JavaScript code.

I was in Berlin for Localization World and then Potsdam to talk about Japanese Layout last month.

I didn’t get much time for photos in Berlin. These photos were mostly taken during the dinner cruise. And in Potsdam it poured with rain most of the day, so the photo look a little dark.

I also uploaded a bunch of photos from a trip to Berlin with the family in 2005.


There are 4 new sets of photos:

A new version of this very popular tool is now available, in a new location. Although it is currently labeled ‘beta’, I recommend that you use that instead, and change any links and bookmarks to the new location. There are a number of new features.

There is also a vastly improved code base. If you are one of the many people who have contacted me to ask how I coded the conversions, please take a look at the new javascript code. It is much cleaner and more compact.

New features include:

* New mixed input field and position of some fields changed.
* New field for conversion of 0x… notation hex escapes.
* Enabled invisible and ambiguous characters to be made visible in the XML output.
* Added support for all HTML entities in HTML/XML input.
* All code rewritten to use characters as the internal representation, rather than code points. Also, code is much smaller and cleaner, partly through use of regular expression matching.
* Various filters available for conversion, such as allowing ASCII or Latin1 characters to remain unconverted in NCR output.
* New icon to quickly select all contents of a field.

There is also a new demonstration feature.

If there are no issues raised/remaining in a couple of months, I’ll remove the beta tag.

>> Use it

Picture of the page in action.

This is a new tool that helps you to locate a country or territory on a map of the world. Ever wondered where Khazakhstan is? This will show you.

The map is in SVG and expands to fill the window. Territories are coloured red. Very small territories are marked by a red dot.

The map comes from Wikipedia. The list of territories comes from the regions listed in the IANA Language Subtag Registry. I can’t guarrantee that all the territories in the pulldown list are viewable, but nearly all are.

It’s quite a big SVG file, so it takes a little while to draw. I’ll try to speed that up in the future. It seems to draw much faster on Chrome or Opera than on Firefox or IE.

For the future I have some other ideas, such as displaying the country name natively, and linking to Wikipedia articles, CLDR data, etc. But that’s for later.

Update: Almost every time I located a country, I found myself wondering what the countries alongside are. So now as you move your mouse over a country, the name of that country pops up.

Enjoy.

>> See what it can do !

>> Use it !

Picture of the page in action.

The major changes in this version include a new feature to normalise text as NFC or NFD, the ability to accept decimal code point values, and an overhaul of top part of the user interface.

Added buttons to the Text area to allow conversion of the text to NFC or NFD normalization forms. (You may not notice the change until you list the characters.)

The control panel was also substantially rearranged again to hopefully make it easier for newcomers to see what they can do.

The Code point conversion feature was upgraded to handle decimal code point values.

A single character in the codepoints area or text area is now listed in the lower left panel when you click on  , rather than in the right-hand properties panel. This is to improve consistency and avoid surprises.

Added a link to the CLDR property demo from the right panel to give access to additional properties.

Improved the parsing of codepoints when surrounded by text in the Code point input field, so that it now works with &#x…; and \u… and \U… escapes.

Jettisoned some unneeded code to reduce download by around 40-50K bytes. Implemented the NFC/NFD feature using AJAX, to avoid putting the download size back up.

When you delete the contents of the text area or the code point area, the associated input field is given focus, so you are ready for input.

A couple more minor bug fixes.

I was asked to make available the code for my normalization functions in JavaScript and PHP. The links are below. I’m making the code available under a Creative Commons Attribution-Noncommercial-Share Alike licence.

Disclaimers Note that I make no claim to have produced polished, compact or well-optimised code! The code does what I need, and I’m happy with that. You are welcome to suggest improvements, and I’m sure there are many that could be made.

As they say, this code is made available in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

The code is a little more convoluted that it ought to be, to get around the fact that JavaScript doesn’t understand supplementary characters, and PHP just doesn’t naturally understand Unicode. (How I long for PHP6.)

Update: [[I meant to mention that there is a way of doing normalization in PHP already. I made this code available just because I had it. I created it as a learning exercise. It may be useful, however, if you are unable to load the ICU and intl packages onto your server.]]

To use the code, simply call nfc('your-text-string') or nfd('your-text-string') from your code and capture the result.

For PHP you’ll need these routines and this data.

For JavaScript look at these routines and this data. There is also a lite version of the data file that doesn’t include Han characters. I use this sometimes for bandwidth savings (about 14K less).

Test files I also created some test files for PHP and for JavaScript.
Both of these expect to find a copy of http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt in the local directory. These files run 71,076 tests.

Cautions Be careful about the editor you use for the data files. I spent several hours fruitlessly debugging the routines, only to find that Notepad++ was displaying certain supplementary characters ok, but corrupting them on save. I switched to Notepad and the problem evaporated. And I probably don’t need to add that editing the data files in something like DreamWeaver is a bad idea because it will probably normalize the data before saving.

Another point: you may see Unicode replacement characters at a couple of points in the PHP source. These represent the first and last characters in the high surrogate range.

Experimenting If you want to play with something that uses this you could try my Tłįchǫ (Dogrib) character picker, or my Normalizer tool. I will slowly fit this to all the pickers and to UniView. I have a local version of UniView waiting in the wings that uses the PHP files via AJAX, to reduce download size. For that you need a file that returns the result as plain text across the wire, such as this.

Well, I hope that that may be of use to someone, somewhere. I hope I haven’t forgotten anything.

>> Try it !

Picture of the page in action.

This tool allows you to normalise short pieces of text to Unicode forms NFC or NFD. You can paste the relevant text into a text area, or append it to the uri that calls the page, eg. Vietnamese example.

Note that, although I spell normalisation in the British way in this post, the uri uses the American spelling, since I suspect most users of the tool will expect it to be spelt that way.

Wondering what normalisation is? In Unicode a letter like á can be represented by a (precomposed) single character or by an a followed by an acute accent (a decomposed sequence). Unicode regards these two representations as formally equivalent. If you are comparing strings, therefore, you need to know which representations are equivalent. Usually you would want to normalise your text prior to comparison to a given normalisation form, so that the comparison process can be efficient. Unicode defines four normalization forms, two of which, NFD and NFC, are handled by this tool.

Basically NFD reduces all precomposed characters to their decomposed equivalents, whereas NFC uses precomposed characters for most common situations.

>> See what it can do !

>> Use it !

Picture of the page in action.

The major changes in this version relate to the way searching and property-based lookup is done on characters in the lower left panel, and features for refining and capturing the resulting lists.

Removed the two Highlight selection boxes. These used to highlight characters in the lower left panel with a specific property value. The Show selection box on the left (used to be Show list) now does that job if you set the Local checkbox alongside it. (Local is the default for this feature.)

As part of that move, the former SiR (search in range) checkbox that used to be alongside Custom range has been moved below the Search for input field, and renamed to Local. If Local is checked, searching can now be done on any content in the lower left panel, and the results are shown as highlighting, rather than a new list.

To complement these new highlighting capabilities, a new feature was added. If you click on the icon next to Make list from highlights the content of the lower left panel will be replaced by a list of just those items that are currently highlighted – whether the highlighting results from a search or a property listing. Note that this can also be useful to refine searches: perform an initial search, convert the result to a list, then perform another search on that list, and so on.

Finally got around to putting  icons after the pull-down lists. This means that if you want to reapply, say, a block selection after doing something else, only one click is needed (rather than having to choose another option, then choose the original option). The effect of this on the ease of use of UniView is much greater than I expected.

Added an icon  to the text area. If you click on this, all the characters in the lower left panel are copied into the text area. This is very useful for capturing the result of a search, or even a whole block. Note that if a list in the lower left panel contains unassigned code points, these are not copied to the text area.

As a result of the above changes, the way Show as graphics and Show range as list work internally was essential rewritten, but users shouldn’t see the difference.

Changed the label Character area to Text area.

>> See what it can do !

>> Use it !

Picture of the page in action.

The main change in this version is the reworking of the former Cut & paste and Code point(s) fields to make it easier to use UniView as a generalised picker.

Moved the cut&paste field downwards, made it larger, and changed the label to character area. This should make it easier to deal with text copy/cut & paste, and more obvious that that is possible with UniView. It is much clearer now that UniView provides character map/picker functionality, and not just character lookup.

Whereas previously you had to double-click to put a character in the lower left pane into the Cut&paste field, UniView now echoes characters to the Character area every time you (single) click on a character in the lower left hand pane. This can be turned off. Double-clicking will still add the codepoint of a character in the lower left panel to the Code points field.

The Character area has its own set of icons, some of which are new: ie. you can select the text, add a space, and change the font of the text in the area (as well as turn the echo on and off). I also spruced up the icons on the UI in general.

Note that on most browsers you can insert characters at the point in the Character area where you set the cursor, or you can overwrite a highlight range of characters, whereas (because of the non-standard way it handles selections and ranges) Internet Explorer will always add characters to the end of the line.

The Code points field has also been enlarged, and I moved the Show list pull-down to the left and Show as graphics and Show page as list to the right. This puts all the main commands for creating lists together on the left.

When you mouse over character in the lower left pane you now see both hex and decimal codepoint information. (Previously you just saw an unlabelled decimal number.) You will also find decimal code point values for characters displayed in the lower right panel.

Fixed a bug in the Code points input feature so that trailing spaces no longer produce errors, but also went much further than that. You can now add random text containing codepoints or most types of hex-based escaped characters to the input field, and UniView will seek them out to create the list. For example, if you paste the following into the Code points field:

the decomposition mapping is <U+CE20, U+11B8>, and not <U+110E, U+1173, U+11B8>.

the result will be:

CE20: 츠 [Hangul Syllables]
11B8: ᆸ HANGUL JONGSEONG PIEUP
110E: ᄎ HANGUL CHOSEONG CHIEUCH
1173: ᅳ HANGUL JUNGSEONG EU
11B8: ᆸ HANGUL JONGSEONG PIEUP

Of course, UniView is not able to tell that an ordinary word like ‘Abba’ is not a hex codepoint, so you obviously need to watch out for that and a few other situations, but much of the time this should make it much easier to extract codepoint information.

I still haven’t found a way to fix the display bug in Safari and Google Chrome that causes initial content in the lower left pane to be only partially displayed.

Here are some lists of characters that are useful for normalization. I’ll probably add some others later.

The lists apply to Unicode version 5.1.

The files below contain declarations for JavaScript sparse arrays. They are easy enough to convert to other formats using global search and replace. The verbose version provides character names and code points.

Combining characters with non-zero properties

Characters with non-zero combining properties are assigned to a sparse array indexed by codepoint. The value gives the combining property value.

http://rishida.net/code/normalization/nonzerocombiningchars.txt

http://rishida.net/code/normalization/nonzerocombiningchars-verbose.txt

There are 498 of these.

Canonically decomposable characters for NFD

This list maps single characters to their decompositions. The single character is referenced by an index into the array, and the value for that index is the decomposed characters.

http://rishida.net/code/normalization/canonicaldecomposables.txt

http://rishida.net/code/normalization/canonicaldecomposables-verbose.txt

There are 2042 of these characters.

The following code converts a hex codepoint to a sequence of bytes that represent the Unicode codepoint in UTF-8.

This is useful because PHP’s chr() function only works on ASCII :( (.

function cp2utf8 ($hexcp) {
	$outputString = '';
	$n = hexdec($hexcp);
	if ($n < = 0x7F) {
		$outputString .= chr($n);
		}
	else if ($n <= 0x7FF) {
		$outputString .= chr(0xC0 | (($n>>6) & 0x1F))
		.chr(0x80 | ($n & 0x3F));
		}
	else if ($n < = 0xFFFF) {
		$outputString .= chr(0xE0 | (($n>>12) & 0x0F))
		.chr(0x80 | (($n>>6) & 0x3F))
		.chr(0x80 | ($n & 0x3F));
		}
	else if ($n < = 0x10FFFF) {
		$outputString .= chr(0xF0 | (($n>>18) & 0x07))
		.chr(0x80 | (($n>>12) & 0x3F)).chr(0x80 | (($n>>6) & 0x3F))
		.chr(0x80 | ($n & 0x3F));
		}
	else {
		$outputString .= 'Error: ' + $n +' not recognised!';
		}
	return $outputString;
	}

>> Use it !

Picture of the page in action.

I have just upgraded the Malayalam picker to level 7, and added a bunch of new features that should show up in other pickers at level 7 as I get time:

Shape view The pickers are aimed particularly at people who are not familiar enough with a script to use the keyboard. However, there are many ligatures and conjuncts in Malayalam, which makes it difficult to identify the character sequences needed. This view provides most of the shapes you’ll see in Malayalam text, grouped by shape. It’s something I’ve been wanting to add to the pickers for some time.

Picture of the page in action.

Phonic view This has been done in other pickers, but it has some new features over those. The sounds have been arranged along similar lines to a standard IPA chart, and multiple transcriptions are supported. In addition, you can click on the transcription text to build up a phonemic string in IPA. This is particularly useful for creating examples.

Picture of the page in action.

Regular expressions in searches The search feature was upgraded to allow for regular expressions. So now you can highlight characters containing GA without highlighting ones containing NGA: just search for \bga\b (or use the convenient short-cut form .ga.). Of course you can do more complicated searches too.

Add codepoint You can add a hex codepoint value to the box in the yellow area to insert into the text. This is useful for things like the odd unusual character, or for just figuring out what a sequence of codepoints represents. You can input any number of codepoints (including surrogates) into the input box, separating them by spaces.

Chillus This version of the picker supports all Unicode 5.1 characters, including the chillu characters. Because most Malayalam fonts support the old way of inputting chillu forms, you can specify in the yellow box area what you want the output to be when clicking on a chillu letter: the pre-5.1 sequence or the new atomic character. (The default is the atomic character.)

The picker also comes with the usual set of level 7 features, such as font grid view, graphic characters, hiding of uncommon characters, optimised ordering of characters in the alphabetic view, two-tone highlighting, etc.

You can start up directly in either of the available views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet, shape, phonic or fontgrid.

Enjoy.

>> See what it can do !

>> Use it !

Picture of the page in action.

A large amount of code was rewritten to enable data to be downloaded from the server via AJAX at the point of need. This eliminates the long wait when you start to use UniView without the database information in your cache. This means that there is a slightly longer delay when you view a new block, but the code is designed so that if you have already downloaded data, you don’t have to retrieve it again from the server.

The search mechanism was also rewritten. The regular expressions used must now be supported in both JavaScript and PHP (PHP is used if not searching within the current range). When ‘other’ is ticked, the search will look in the alternative name fields, but not in other property settings (so you can no longer use something like ;AL; to search for characters with a particular property. (Use ‘Show list’ instead.))

Removed several zero-width space characters from the code, which means that UniView now works with Google Chrome, except for some annoying display bugs that I’m not sure how to fix – for example, the first time you try to display any block you only seem to get the top line (although, if you click or drag the mouse, the block is actually there). This seems to be WebKit related, since it happens in Safari, too.

Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.

>> Read it !

Picture of the page in action.

I finally got to the point, after many long early morning hours, where I felt I could remove the ‘Draft’ from the heading of my Myanmar (Burmese) script notes.

This page is the result of my explorations into how the Myanmar script is used for the Burmese language in the context of the Unicode Myanmar block. It takes into account the significant changes introduced in Unicode version 5.1 in April of this year.

Btw, if you have JavaScript running you can get a list of characters in the examples by mousing over them. If you don’t have JS, you can link to the same information.

There’s also a PDF version, if you don’t want to install the (free) fonts pointed to for the examples.

Here is a summary of the script:

Myanmar is a tonal language and is syllable-based. The script is an abugida, ie. consonants carry an inherent vowel sound that is overridden using vowel signs.

Spaces are used to separate phrases, rather than words. Words can be separated with ZWSP to allow for easy wrapping of text.

Words are composed of syllables. These start with an consonant or initial vowel. An initial consonant may be followed by a medial consonant, which adds the sound j or w. After the vowel, a syllable may end with a nasalisation of the vowel or an unreleased glottal stop, though these final sounds can be represented by various different consonant symbols.

At the end of a syllable a final consonant usually has an ‘asat’ sign above it, to show that there is no inherent vowel.

In multisyllabic words derived from an Indian language such as Pali, where two consonants occur internally with no intervening vowel, the consonants tend to be stacked vertically, and the asat sign is not used.

Text runs from left to right.

There are a set of Myanmar numerals, which are used just like Latin digits.

So, what next. I’m quite keen to get to Mongolian. That looks really complicated. But I’ve been telling myself for a while that I ought to look at Malayalam or Tamil, so I think I’ll try Malayalam.

>> Use it !

Picture of the page in action.

I have just upgraded the Burmese picker as follows:

Rearranged characters The Myanmar3 font expects multiple combining characters to be entered in the order described in the Unicode 5.1 Standard for correct display. The panel of combining characters has been arranged so that you can easily remember what that order was. Characters to the left precede those to the right, characters higher up precede those lower down.

In addition to that, I have rearranged all the character positions so that it is easier to locate the various parts of a syllable as you type.

I also added some combinations of characters that make up multi-part vowels and the kinzi with a single click.

I have also moved some of the less common characters to an ‘advanced’ area to the right which can be opened and closed by clicking on the arrow-head icon.

New highlighting As you mouse over a character the picker will show you other characters that are visually similar (particularly useful for those not very familiar with the script). This new version shows the more likely confusable characters with a blue outline, and other similar characters with orange. This is useful given that many Myanmar characters look quite similar.

As always, you can turn off this feature or disable it in the URI you use to call the picker.

Font grid view Shows characters in Unicode order, using whatever font is specified in the Font list or Custom font input fields. This allows comparison of fonts (especially useful in IE, which shows if a glyph is missing from a font).

You can start up directly in either of the available views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet or fontgrid.

Enjoy.

I’m sitting here watching a video of Timbl talking on a BBC news page and I suddenly realised how good this was.

The page design helps give the impression – there are no clunky boxes around the video itself – but there’s also no need to view in a different area, or switch to another tool, or even wait for a download to get started – it’s just there as part of the page, but a part that moves and produces sound. Kind of like the moving paper in Harry Potter’s world.

It’s great how technology marches on sometimes.

[Update: Since I wrote the above the video has acquired grey panels around the edges for controls, which I think is a shame. It's still pretty good technology though. ]

Next Page »