Search Results for 'picker'
Posted on Wed 4 Feb 2009 under code notes, general, i18n, utilities, web
I was asked to make available the code for my normalization functions in JavaScript and PHP. The links are below. I’m making the code available under a Creative Commons Attribution-Noncommercial-Share Alike licence.
Disclaimers Note that I make no claim to have produced polished, compact or well-optimised code! The code does what I need, and I’m happy with that. You are welcome to suggest improvements, and I’m sure there are many that could be made.
As they say, this code is made available in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
The code is a little more convoluted that it ought to be, to get around the fact that JavaScript doesn’t understand supplementary characters, and PHP just doesn’t naturally understand Unicode. (How I long for PHP6.)
Update: [[I meant to mention that there is a way of doing normalization in PHP already. I made this code available just because I had it. I created it as a learning exercise. It may be useful, however, if you are unable to load the ICU and intl packages onto your server.]]
To use the code, simply call nfc('your-text-string') or nfd('your-text-string') from your code and capture the result.
For PHP you’ll need these routines and this data.
For JavaScript look at these routines and this data. There is also a lite version of the data file that doesn’t include Han characters. I use this sometimes for bandwidth savings (about 14K less).
Test files I also created some test files for PHP and for JavaScript.
Both of these expect to find a copy of http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt in the local directory. These files run 71,076 tests.
Cautions Be careful about the editor you use for the data files. I spent several hours fruitlessly debugging the routines, only to find that Notepad++ was displaying certain supplementary characters ok, but corrupting them on save. I switched to Notepad and the problem evaporated. And I probably don’t need to add that editing the data files in something like DreamWeaver is a bad idea because it will probably normalize the data before saving.
Another point: you may see Unicode replacement characters at a couple of points in the PHP source. These represent the first and last characters in the high surrogate range.
Experimenting If you want to play with something that uses this you could try my Tłįchǫ (Dogrib) character picker, or my Normalizer tool. I will slowly fit this to all the pickers and to UniView. I have a local version of UniView waiting in the wings that uses the PHP files via AJAX, to reduce download size. For that you need a file that returns the result as plain text across the wire, such as this.
Well, I hope that that may be of use to someone, somewhere. I hope I haven’t forgotten anything.
Posted on Wed 7 Jan 2009 under general, i18n, utilities, web
>> See what it can do !
>> Use it !

The main change in this version is the reworking of the former Cut & paste and Code point(s) fields to make it easier to use UniView as a generalised picker.
Moved the cut&paste field downwards, made it larger, and changed the label to character area. This should make it easier to deal with text copy/cut & paste, and more obvious that that is possible with UniView. It is much clearer now that UniView provides character map/picker functionality, and not just character lookup.
Whereas previously you had to double-click to put a character in the lower left pane into the Cut&paste field, UniView now echoes characters to the Character area every time you (single) click on a character in the lower left hand pane. This can be turned off. Double-clicking will still add the codepoint of a character in the lower left panel to the Code points field.
The Character area has its own set of icons, some of which are new: ie. you can select the text, add a space, and change the font of the text in the area (as well as turn the echo on and off). I also spruced up the icons on the UI in general.
Note that on most browsers you can insert characters at the point in the Character area where you set the cursor, or you can overwrite a highlight range of characters, whereas (because of the non-standard way it handles selections and ranges) Internet Explorer will always add characters to the end of the line.
The Code points field has also been enlarged, and I moved the Show list pull-down to the left and Show as graphics and Show page as list to the right. This puts all the main commands for creating lists together on the left.
When you mouse over character in the lower left pane you now see both hex and decimal codepoint information. (Previously you just saw an unlabelled decimal number.) You will also find decimal code point values for characters displayed in the lower right panel.
Fixed a bug in the Code points input feature so that trailing spaces no longer produce errors, but also went much further than that. You can now add random text containing codepoints or most types of hex-based escaped characters to the input field, and UniView will seek them out to create the list. For example, if you paste the following into the Code points field:
the decomposition mapping is <U+CE20, U+11B8>, and not <U+110E, U+1173, U+11B8>.
the result will be:
CE20: 츠 [Hangul Syllables]
11B8: ᆸ HANGUL JONGSEONG PIEUP
110E: ᄎ HANGUL CHOSEONG CHIEUCH
1173: ᅳ HANGUL JUNGSEONG EU
11B8: ᆸ HANGUL JONGSEONG PIEUP
Of course, UniView is not able to tell that an ordinary word like ‘Abba’ is not a hex codepoint, so you obviously need to watch out for that and a few other situations, but much of the time this should make it much easier to extract codepoint information.
I still haven’t found a way to fix the display bug in Safari and Google Chrome that causes initial content in the lower left pane to be only partially displayed.
Posted on Thu 6 Nov 2008 under general, i18n, utilities, web
>> Use it !

I have just upgraded the Malayalam picker to level 7, and added a bunch of new features that should show up in other pickers at level 7 as I get time:
Shape view The pickers are aimed particularly at people who are not familiar enough with a script to use the keyboard. However, there are many ligatures and conjuncts in Malayalam, which makes it difficult to identify the character sequences needed. This view provides most of the shapes you’ll see in Malayalam text, grouped by shape. It’s something I’ve been wanting to add to the pickers for some time.

Phonic view This has been done in other pickers, but it has some new features over those. The sounds have been arranged along similar lines to a standard IPA chart, and multiple transcriptions are supported. In addition, you can click on the transcription text to build up a phonemic string in IPA. This is particularly useful for creating examples.

Regular expressions in searches The search feature was upgraded to allow for regular expressions. So now you can highlight characters containing GA without highlighting ones containing NGA: just search for \bga\b (or use the convenient short-cut form .ga.). Of course you can do more complicated searches too.
Add codepoint You can add a hex codepoint value to the box in the yellow area to insert into the text. This is useful for things like the odd unusual character, or for just figuring out what a sequence of codepoints represents. You can input any number of codepoints (including surrogates) into the input box, separating them by spaces.
Chillus This version of the picker supports all Unicode 5.1 characters, including the chillu characters. Because most Malayalam fonts support the old way of inputting chillu forms, you can specify in the yellow box area what you want the output to be when clicking on a chillu letter: the pre-5.1 sequence or the new atomic character. (The default is the atomic character.)
The picker also comes with the usual set of level 7 features, such as font grid view, graphic characters, hiding of uncommon characters, optimised ordering of characters in the alphabetic view, two-tone highlighting, etc.
You can start up directly in either of the available views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet, shape, phonic or fontgrid.
Enjoy.
Posted on Mon 6 Oct 2008 under general, i18n, web, writings
>> Read it !

I finally got to the point, after many long early morning hours, where I felt I could remove the ‘Draft’ from the heading of my Myanmar (Burmese) script notes.
This page is the result of my explorations into how the Myanmar script is used for the Burmese language in the context of the Unicode Myanmar block. It takes into account the significant changes introduced in Unicode version 5.1 in April of this year.
Btw, if you have JavaScript running you can get a list of characters in the examples by mousing over them. If you don’t have JS, you can link to the same information.
There’s also a PDF version, if you don’t want to install the (free) fonts pointed to for the examples.
Here is a summary of the script:
Myanmar is a tonal language and is syllable-based. The script is an abugida, ie. consonants carry an inherent vowel sound that is overridden using vowel signs.
Spaces are used to separate phrases, rather than words. Words can be separated with ZWSP to allow for easy wrapping of text.
Words are composed of syllables. These start with an consonant or initial vowel. An initial consonant may be followed by a medial consonant, which adds the sound j or w. After the vowel, a syllable may end with a nasalisation of the vowel or an unreleased glottal stop, though these final sounds can be represented by various different consonant symbols.
At the end of a syllable a final consonant usually has an ‘asat’ sign above it, to show that there is no inherent vowel.
In multisyllabic words derived from an Indian language such as Pali, where two consonants occur internally with no intervening vowel, the consonants tend to be stacked vertically, and the asat sign is not used.
Text runs from left to right.
There are a set of Myanmar numerals, which are used just like Latin digits.
So, what next. I’m quite keen to get to Mongolian. That looks really complicated. But I’ve been telling myself for a while that I ought to look at Malayalam or Tamil, so I think I’ll try Malayalam.
Posted on Thu 2 Oct 2008 under general, i18n, utilities, web
>> Use it !

I have just upgraded the Burmese picker as follows:
Rearranged characters The Myanmar3 font expects multiple combining characters to be entered in the order described in the Unicode 5.1 Standard for correct display. The panel of combining characters has been arranged so that you can easily remember what that order was. Characters to the left precede those to the right, characters higher up precede those lower down.
In addition to that, I have rearranged all the character positions so that it is easier to locate the various parts of a syllable as you type.
I also added some combinations of characters that make up multi-part vowels and the kinzi with a single click.
I have also moved some of the less common characters to an ‘advanced’ area to the right which can be opened and closed by clicking on the arrow-head icon.
New highlighting As you mouse over a character the picker will show you other characters that are visually similar (particularly useful for those not very familiar with the script). This new version shows the more likely confusable characters with a blue outline, and other similar characters with orange. This is useful given that many Myanmar characters look quite similar.
As always, you can turn off this feature or disable it in the URI you use to call the picker.
Font grid view Shows characters in Unicode order, using whatever font is specified in the Font list or Custom font input fields. This allows comparison of fonts (especially useful in IE, which shows if a glyph is missing from a font).
You can start up directly in either of the available views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet or fontgrid.
Enjoy.
Posted on Fri 14 Mar 2008 under general, i18n, utilities, web
>> Use it !

This latest picker includes all characters in the Unicode Lao block, plus a few punctuation characters. There are several alternative views.
Alphabetic By default, characters are arranged by groups, and consonants and vowels are listed in alphabetic order. Digits are in keypad order. Similar characters are highlighted by default, but this can be switched off using the ‘Hint’ selector.
Tone marks and combining vowels are reordered automatically so that vowels come first in the output character sequence.
Phonic Characters are grouped and ordered by sound. I set this up for myself to enter Lao text that I wanted to copy that was accompanied by a transcription. Initial consonants are followed by tones and consonants that come second in a cluster, then vowels. Alternatives with the same sound are separated by a red dot. Consonants that have different sounds when word final are also listed under those sounds. (Dropped aspiration is not considered significant.)
Dashes representing consonants indicate which vowels are non-final or occur before the consonant. Where a vowel has a part that comes before a consonant, a single click should arrange the parts properly. This behaviour speeds up typing. It may not be so intuitive to people familiar with Lao, however, since it makes Lao behave like Khmer and Indic scripts.
You should add any tone mark before the vowel and the picker will automatically reorder characters as needed. If you want to wrap text around a combination of two syllable-initial characters, type the characters then click on ‘flag as cluster’ before clicking on the tone mark or vowel.
Two old vowel spellings are only displayed if you click on the grey arrow, top right.
Font grid Shows characters in Unicode order, using whatever font is specified in the Font list or Custom font input fields. This allows comparison of fonts (especially useful in IE, which shows if a glyph is missing from a font).
You can start up directly in one of the above views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet, phonic or fontgrid.
Enjoy.
Posted on Wed 27 Feb 2008 under general, i18n, utilities
>> Use it !

The default arrangement for this picker is still shape-based (though with some small improvements), but I have added a new view that is arranged by sound.
Update: After some initial feedback, I decided to change the phonic view of the picker so that vowels are entered by single click. This will probably disconcert people familiar with typing Thai. Revised description follows.
Another update (2008-03-03): I have added additional ways of viewing the characters, and re-architected the picker as a basis for extending this to other pickers in the future. I also changed the way of dealing with initial clusters in the phonic view. I changed the text below again to reflect what’s new:
Alphabetic view By default, characters are arranged by groups, and consonants and vowels are listed in alphabetic order. Digits are in keypad order. Obsolete and rare characters are only displayed if you click on the grey arrow, top right. Similar characters are highlighted by default, but this can be switched off using the ‘Hint’ selector.
Comparison view This was the original view for the Thai picker. Characters are grouped by shape or type to enable easy identification by people who are unfamiliar with the Thai script. Vowels are shown near the bottom. Digits are on the right, in keypad order.
Phonic view Characters are grouped and ordered by sound. I set this up for myself, because I wanted to enter Thai text that was accompanied by a transcription.
Initial consonants are followed by tones and consonants that come second in a cluster, then vowels. Alternatives with the same sound are separated by a red dot. Consonants that have different sounds when word final are also listed under those sounds. (Dropped aspiration is not considered significant.)
Dashes representing consonants indicate which vowels are non-final or occur before the consonant.
Where a vowel has a part that comes before a consonant, a single click should arrange the parts properly. This behaviour speeds up typing. It may not be so intuitive to people familiar with Thai, however, since it makes Thai behave like Khmer and Indic scripts. You should add any tone mark before the vowel and the picker will automatically reorder characters as needed.
If you want to wrap text around a combination of two syllable-initial characters, type the characters then click on ‘flag as cluster’ before clicking on the tone mark or vowel.
Font grid view Shows characters in Unicode order, using whatever font is specified in the Font list or Custom font input fields. This allows comparison of fonts (especially useful in IE, which shows if a glyph is missing from a font).
You can start up directly in any one of the above views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet, comparison, phonic or fontgrid.
Enjoy.
Posted on Thu 21 Feb 2008 under general, i18n, utilities, web
>> Use it !

This latest picker includes characters used for writing Vietnamese. Characters are taken from various Latin Unicode blocks.
Tones are separated from base characters in the selection area, but the output you create is always fully precomposed. If you copy and paste text into the output area, you can normalize the Vietnamese text as NFC by selecting the tab below. The Vietnamese text in the output area is also normalized when you select one of the transcription tabs.
The tabs IPA N and IPA S tabs provide a basic, mostly phonemic-level, transcription of the pronunciation. N means North Vietnamese, S is for South. The sources I used for this varied a great deal, particularly in the choice of symbols to represent vowels. There are also more than two main dialects. So this is a synthesis and a rough guide. Some rare vowel combinations may be missing, although I have covered quite a number.
There are a large number of UVN fonts – so many that I didn’t know which ones to pick for the font pulldown. I chose the two that show up on Alan Wood’s page. If you think certain others are so common that they ought to be there, please let me know.
Enjoy.
Posted on Fri 4 Jan 2008 under general, i18n, utilities, web
>> Use it !

Although I have a picker already for Arabic, Persian and Urdu, I have developed another that is specifically for inputting Urdu. One reason for this is to reduce the choice of characters so that the user is more likely to select the right character for Urdu (eg. heh goal rather than arabic heh). Another is to provide shortcuts for things like aspirated letters and some common combinations (like the word ‘allah’).
It includes characters used for Urdu in Unicode 5.0. Most of the characters in the Urdu standard UZT 1.01 are included.
The aspirated letters of the alphabet can be entered with a single click. Also, base characters with diacritics can be inserted into the text with a single click where NFC normalisation would produce a single precomposed character.
Letters of the alphabet are shown in alphabetic order at the top left, digits are in keypad order, and combining characters related to vowel sounds are shown along the bottom. The lower middle section contains useful but non-alphabetic characters and punctuation. To the right are various symbols. Hinting is implemented for visually similar glyphs.
Posted on Fri 4 Jan 2008 under general, i18n, utilities, web
>> Use it !

Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more useable than a regular character map utility.
The Bengali picker includes all the characters in the Unicode 5.0 Bengali block. Note: There was an important addition to the Bengali block in version 4.1, a single character for khanda ta, that may not yet be supported in fonts, but has been added to this version of the picker.
Consonants are mostly in a typical articulatory arrangement, vowels are aligned with vowel signs, and digits are in keypad order. Hinting is implemented for visually similar glyphs.
A function has also been added to transliterate Bengali text to Latin, though the scheme used is not standard, and may change at short notice. Don’t use it in anger yet.
Posted on Fri 4 Jan 2008 under general, i18n, utilities, web
I’ve been wanting to improve the editing behaviour of my pickers for quite some time, so that users could interact more easily with the keyboard, and insert characters into the middle of a composition, not just at the end. In fact, the output area maintains the focus all the time, now – which makes a major improvement to the usability of the pickers.
This week I made those things happen, and created a new template with some other changes, too.
An updated Bengali picker is first out of the box, but look out for a brand new Urdu-specific picker to follow close on its heels. I will retrofit the new template to other pickers as time allows, or need dictates.
I also beefed up the font selection list with a large number of TT and OT fonts, and improved the reference material at the bottom.
I improved the mechanism that highlights similar characters, to give more fine-grained control to the associations between characters.
I also added a field just under the title that gives information about the character the user is mousing over, and added a search field to help users find characters for which they know the Unicode name or number. I plan to extend the information associated with characters in future to include native names (eg. e-kar) and other useful search info.
I also changed the scripting and HTML so that a single click can now produce multiple characters in the composition field. This will allow users to input ligatures like the indic ‘ksha’ or Urdu aspirated consonants, or complex sequences tied to ligatures (like the word ‘Allah’) with a simple click.
Some things have also been removed. There is no DEL button now, since you can interact more easily with the keyboard for that. Spaces are available from the (now rationalised) character area, rather than a button. And there is no longer an option to switch between graphics and characters for the selection. This is partly for simplicity, and partly to make it easier to represent some of the slightly more complicated selection options I want to add in future – for example, specific shapes are appropriate for Urdu arabic characters, and I don’t want to leave it to chance as to whether the user’s system has the right fonts to produce the desired shapes.
Getting to this actually required a huge amount of unseen work, since I had to wrap all the images in button markup and move and change attributes, etc. so that the composition box retains the focus in IE (it worked fine for Firefox, Opera and Safari). I also, of course, made significant, but probably not noticeable, changes to the Javascript and CSS.
Posted on Mon 19 Nov 2007 under general, i18n, web, writings

The word Mandalay in Myanmar script.
I’ve been brushing up on the Myanmar script, since major changes are on the way with Unicode 5.1.
I upgraded my myanmar picker to handle the new characters, and I edited my notes on how the script works.
The new characters will make a big difference to how you author text in Unicode, and people will need to update currently existing pages to bring them in line with the new approach. The changes should make it much easier to create content in Burmese, in addition to addressing some niggly problems with making the script work correctly. One reason the changes were sanctioned is that there is currently very little Burmese content out there in Unicode.
I’ll be updating my character by character notes later too.
The only problem with all this is that existing fonts will all need to be changed to support the new world order (or myanmar order). I found one font that is already 5.1 ready from the Myanmar Unicode & NLP Research Center. So if you don’t want to download that font, you’ll need to read the PDF version of my notes on the script.
That would be a pity, however, since i had some fun adding javascript to the article today, so that it displays a breakdown, character by character, of each example as you mouse over it (using images, so you see it properly).
Posted on Mon 6 Nov 2006 under general, i18n, utilities, web
New picker
I finally got around to studying the Tibetan script. To help with that I created a Tibetan picker.
This picker includes all the characters in the Unicode Tibetan block.
The default shows all characters as images due to the rarity of Tibetan fonts. Consonants are mostly in a typical articulatory arrangement, with vowels below, and digits in keypad order.
Since characters cover the whole Tibetan block, there are many characters that are used for transcriptions rather than just the characters needed for ordinary Tibetan text. There are also many symbols, and three characters that are not in the Tibetan block itself. I tried to arrange things so that the most commonly used characters for Tibetan or Dzonkha are easy to get at, but I’m open to suggestions.
Note that the Tibetan Machine Uni font I use as a default setting is an OpenType font that requires version 1.453.3665.0 or later of the Uniscribe engine (usp10.dll). So the output is not ideal in my browser. Works fine if you cut and paste into MS Word though.
Enjoy.
Update:
I installed a later version of uniscribe, and now my Tibetan text looks fine in the browser as well as in Office. On my previous laptop I just used a small tool that’s downloadable from the Tibetan & Himalayan Digital Library. My new laptop, however, didn’t work with that tool – I’ve no idea why. So I had to resort to using the Windows Recovery Console.
I’m already subscribed to Microsoft Volt, so I used the latest uniscribe version from there, dated 4 jan 2006.
Posted on Fri 30 Jun 2006 under general, utilities
New picker
Someone who uses the pickers for cataloguing the Asian language collection in a UK library asked me to provide a Gujarati picker. Since he was suitably flattering about the other pickers, I thought I ought to oblige.
This picker includes all the characters in the Unicode Gujarati block.
The default shows all characters as images due to the rarity of Gujarati fonts. Consonants are mostly in a typical articulatory arrangement, vowels are aligned with vowel signs, and digits are in keypad order. I have not implemented any highlighting of similar characters, since I put this together very quickly.
Enjoy.
Posted on Thu 29 Jun 2006 under general, i18n, utilities
Updated picker
Thanks to Gernot Katzer, I realised that during the recent styling update for the Hebrew picker I missed out a whole bunch of combining marks.
It’s now fixed. Thanks Gernot!
Posted on Wed 5 Apr 2006 under general, i18n, utilities, web
New picker
This new picker includes all the characters you usually need for Arabic, Persian and Urdu. It is a subset of the existing Arabic Block picker, which has many more characters to choose from that only get in the way most of the time when dealing with the aforementioned languages.
The default shows all characters as images. Characters are arranged so that similar looking characters are together.
I also added clickable images for things like RLM, LRM, ZWNJ, etc. These will be making appearances throughout the pickers as I roll out the new look and feel.
Posted on Sun 26 Mar 2006 under general, i18n, utilities, web
New picker
This new picker includes all the characters in the Unicode Myanmar block.
The default shows all characters as images due to the rarity of Malayalam fonts. Consonants are mostly in a typical articulatory arrangement, vowels are aligned with vowel signs, and digits are in keypad order. Hinting is implemented for visually similar glyphs.
I don’t know a lot about Myanmar yet, so any suggestions for improving the layout are welcome. (Noting that this is supposed to help recognition of characters by people who are new to the script.)
Posted on Fri 17 Mar 2006 under general, i18n, utilities, web
New version.
Just produced version 2.0 with the following changes:
- Added some glyphs that were missing by comparing with the information at the IPA home page.
- Added image/character switch, but changed it so that combining characters are only ever shown as images.
- Combining graphics are now distinguished by a pale blue background.
- Added some links and information to the bottom of the page.
The French version has not been updated.
Posted on Sat 20 Aug 2005 under general, utilities
New app
This picker includes all characters in the Unicode 4.0 Ethiopic block. It does not cover additions in Unicode version 4.1.
(Before creating a full Ethiopic picker, I need to get a font that covers all the new characters, and I need to figure out how to best fit the new characters into the current arrangement.)
Letters are arranged in the Unicode code page order, which is aligned with the traditional consonant-vowel matrix. I didn’t actually use a table, to save space. If you have strong views on the layout, send me some suggestions.
Posted on Fri 19 Aug 2005 under general, utilities
New app
Includes all characters in the Unicode Armenian block. Letters are arranged in the Unicode code page order, but upper and lower case letters are side by side.