I’m sitting here watching a video of Timbl talking on a BBC news page and I suddenly realised how good this was.

The page design helps give the impression - there are no clunky boxes around the video itself - but there’s also no need to view in a different area, or switch to another tool, or even wait for a download to get started - it’s just there as part of the page, but a part that moves and produces sound. Kind of like the moving paper in Harry Potter’s world.

It’s great how technology marches on sometimes.

[Update: Since I wrote the above the video has acquired grey panels around the edges for controls, which I think is a shame. It's still pretty good technology though. ]

>> See what it can do !

>> Use it !

Picture of the page in action.

Those of you who have used UniView over the last couple of days will have seen that it now supports Unicode 5.1. All Unicode 5.1 character information is available, however you will only be able to see the new characters if you have fonts that cover them. The decodeunicode graphics for the new characters are not yet available.

Last night I also fixed a long-running bug that had meant that additional information available in my character database was not accessible in Internet Explorer (due to AJAX issues). (See the related post if you are interested in the code).

There are no other changes at this time (though those two are pretty significant).

Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.

Some code I put together to import some XML retrieved via AJAX into a document (stored here so I can find it again in the future).

IE won’t let you import a cloned nodeset into a document, so I wrote this for my UniView utility. The code starts with a node in the AJAX data and creates a copy of all elements and attributes in the current document.

function copyNodes (ajaxnode, copiednode) {
	for (var node=ajaxnode.firstChild; node != null; node = node.nextSibling) {
		if (node.nodeType == 3){ //text
			copiednode.appendChild(document.createTextNode(node.data));
			}
		if (node.nodeType == 1){ //element
			var subnode = document.createElement(node.nodeName);
			var attlist = node.attributes;
			if (attlist != null) {
				for (var i=0; i<attlist.length; i++){
					subnode.setAttribute(attlist[i].name, attlist[i].value);
					}
				}
			copiednode.appendChild(subnode);
			copyNodes(node, subnode);
			}
		}
	}

It doesn’t expect processing instructions, comments etc. Just elements and attributes. (Though of course that can be added, if needed.)

Picture of typical links section.

W3C Internationalization articles have links in the top right corner to translated versions of the page. When a new translation is provided, these links need updating on each translated version of the article in question. This has been a pain to do.

I just published details of a new approach to managing these changes which means that I no longer have to touch the files themselves, and can produce the changes with a single, very small edit.

I’m not claiming that this is the ideal solution (though so far it seems pretty helpful, and way better than the previous approach) - just documenting it for those who are interested.

>> Use it !

Picture of the page in action.

This latest picker includes all characters in the Unicode Lao block, plus a few punctuation characters. There are several alternative views.

Alphabetic By default, characters are arranged by groups, and consonants and vowels are listed in alphabetic order. Digits are in keypad order. Similar characters are highlighted by default, but this can be switched off using the ‘Hint’ selector.

Tone marks and combining vowels are reordered automatically so that vowels come first in the output character sequence.

Phonic Characters are grouped and ordered by sound. I set this up for myself to enter Lao text that I wanted to copy that was accompanied by a transcription. Initial consonants are followed by tones and consonants that come second in a cluster, then vowels. Alternatives with the same sound are separated by a red dot. Consonants that have different sounds when word final are also listed under those sounds. (Dropped aspiration is not considered significant.)

Dashes representing consonants indicate which vowels are non-final or occur before the consonant. Where a vowel has a part that comes before a consonant, a single click should arrange the parts properly. This behaviour speeds up typing. It may not be so intuitive to people familiar with Lao, however, since it makes Lao behave like Khmer and Indic scripts.

You should add any tone mark before the vowel and the picker will automatically reorder characters as needed. If you want to wrap text around a combination of two syllable-initial characters, type the characters then click on ‘flag as cluster’ before clicking on the tone mark or vowel.

Two old vowel spellings are only displayed if you click on the grey arrow, top right.

Font grid Shows characters in Unicode order, using whatever font is specified in the Font list or Custom font input fields. This allows comparison of fonts (especially useful in IE, which shows if a glyph is missing from a font).

You can start up directly in one of the above views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet, phonic or fontgrid.

Enjoy.

>> See what it can do !

>> Use it !

Picture of the page in action.

While we await Unicode 5.1, here is another update to UniView that provides a bunch of additional useful features and fixes a few bugs.

Changes include:

  • Changed the custom range input to a single field that will accept various range formats. This makes it easier to cut and paste or drag and drop ranges into the input field. The Custom range field will accept various formats.
  • The numbers must be in hexadecimal form and separated by a colon (the default), a hyphen, one or more spaces, or one or more periods. There must be only two numbers. The numbers can be in the following formats: 1234, &#x1234;, &#1234;, \u1234, U+1234. The actual number of hex digits can be between 1 and 6.
  • Added the ability to select whether Search looks at any combination of character names only, other parts of a record in the Unicode database, or the other character description information, and added a message to say how many characters were matched.
  • Added the ability to search within the range specified in the field entitled Range.
  • Added the ability to list characters with a given General or Bidirectional property (within a specified range or not).
  • Added an AJAX link to my database of information about Unicode characters. If enabled, using the DB checkbox, this automatically retrieves any available data for a character when information about that character is displayed in the lower right panel. You can also specify that UniView should open with that set as the default using database=on in the URI used to call UniView.
  • Because of the previous improvement, I removed the ability to link in a file of information about characters. (The information in the files was a copy of the information in the database.)
  • Moved the Code point(s) and Cut & paste fields lower, to make them easier to use.
  • Fixed a bug that was preventing the Search function finding characters in the Basic Latin block.
  • Bugfix: a range like 0036:0067 will always show full rows now; a range with start higher than end will show alert.
  • Added reference to decodeunicode when graphics are displayed in left column
  • Bugfix: search parameter won’t break when graphics etc toggled
  • You can now specify windowHeight parameter at startup in the URI’s query string.

Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.

>> Use it !

Picture of the page in action.

The default arrangement for this picker is still shape-based (though with some small improvements), but I have added a new view that is arranged by sound.

Update: After some initial feedback, I decided to change the phonic view of the picker so that vowels are entered by single click. This will probably disconcert people familiar with typing Thai. Revised description follows.

Another update (2008-03-03): I have added additional ways of viewing the characters, and re-architected the picker as a basis for extending this to other pickers in the future. I also changed the way of dealing with initial clusters in the phonic view. I changed the text below again to reflect what’s new:

Alphabetic view By default, characters are arranged by groups, and consonants and vowels are listed in alphabetic order. Digits are in keypad order. Obsolete and rare characters are only displayed if you click on the grey arrow, top right. Similar characters are highlighted by default, but this can be switched off using the ‘Hint’ selector.

Comparison view This was the original view for the Thai picker. Characters are grouped by shape or type to enable easy identification by people who are unfamiliar with the Thai script. Vowels are shown near the bottom. Digits are on the right, in keypad order.

Phonic view Characters are grouped and ordered by sound. I set this up for myself, because I wanted to enter Thai text that was accompanied by a transcription.

Initial consonants are followed by tones and consonants that come second in a cluster, then vowels. Alternatives with the same sound are separated by a red dot. Consonants that have different sounds when word final are also listed under those sounds. (Dropped aspiration is not considered significant.)

Dashes representing consonants indicate which vowels are non-final or occur before the consonant.

Where a vowel has a part that comes before a consonant, a single click should arrange the parts properly. This behaviour speeds up typing. It may not be so intuitive to people familiar with Thai, however, since it makes Thai behave like Khmer and Indic scripts. You should add any tone mark before the vowel and the picker will automatically reorder characters as needed.

If you want to wrap text around a combination of two syllable-initial characters, type the characters then click on ‘flag as cluster’ before clicking on the tone mark or vowel.

Font grid view Shows characters in Unicode order, using whatever font is specified in the Font list or Custom font input fields. This allows comparison of fonts (especially useful in IE, which shows if a glyph is missing from a font).

You can start up directly in any one of the above views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet, comparison, phonic or fontgrid.

Enjoy.

>> Use it !

Picture of the page in action.

This latest picker includes characters used for writing Vietnamese. Characters are taken from various Latin Unicode blocks.

Tones are separated from base characters in the selection area, but the output you create is always fully precomposed. If you copy and paste text into the output area, you can normalize the Vietnamese text as NFC by selecting the tab below. The Vietnamese text in the output area is also normalized when you select one of the transcription tabs.

The tabs IPA N and IPA S tabs provide a basic, mostly phonemic-level, transcription of the pronunciation. N means North Vietnamese, S is for South. The sources I used for this varied a great deal, particularly in the choice of symbols to represent vowels. There are also more than two main dialects. So this is a synthesis and a rough guide. Some rare vowel combinations may be missing, although I have covered quite a number.

There are a large number of UVN fonts - so many that I didn’t know which ones to pick for the font pulldown. I chose the two that show up on Alan Wood’s page. If you think certain others are so common that they ought to be there, please let me know.

Enjoy.

This post is about the dangers of tying a specification, protocol or application to a specific version of Unicode.

For example, I was in a discussion last week about XML, and the problems caused by the fact that XML 1.0 is currently tied to a specific version of Unicode, and a very old version at that (2.0). This affects what characters you can use for things such as element and attribute names, enumerated lists for attribute values, and ids. Note that I’m not talking about the content, just those names.

I spoke about this at a W3C Technical Plenary some time back in terms of how this bars people from using certain aspects of XML applications in their own language if they use scripts that have been added to Unicode since version 2.0. This includes over 150 million people speaking languages written with Ethiopic, Canadian Syllabics, Khmer, Sinhala, Mongolian, Yi, Philippine, New Tai Lue, Buginese, Cherokee, Syloti Nagri, N’Ko, Tifinagh and other scripts.

This means, for example, that if your language is written with one of these scripts, and you write some XHTML that you want to be valid (so you can use it with AJAX or XSLT, etc.), you can’t use the same language for an id attribute value as for the content of your page. (Try validating this page now. The previous link used some Ethiopic for the name and id attribute values.)

But there’s another issue that hasn’t received so much press - and yet I think, in it’s own way, it can be just as problematic. Scripts that were supported by Unicode 2.0 have not stood still, and additional characters are being added to such scripts with every new Unicode release. In some cases these characters will see very general use. Take for example, the Bengali character U+09CE BENGALI LETTER KHANDA TA.

With the release of Unicode 4.1 this character was added to the standard, with a clear admonition that it should in future be used in text, rather than the workaround people had been using previously.

This is not a rarely used character. It is a common part of the alphabet. Put Bengali in a link and you’re generally ok. Include a khanda ta letter in it, though, and you’re in trouble. It’s as if English speakers could use any word in an id, as long as it didn’t have a ‘q’ in it. It’s a recipe for confusion and frustration.

Similar, but much more far reaching, changes will be introduced to the Myanmar script (used for Burmese) in the upcoming version 5.1. Unlike the khanda ta, these changes will affect almost every word. So if your application or protocol froze its Unicode support to a version between 3.0 and 5.0, like IDNA, you will suddenly be disenfranchising Burmese users who had been perfectly happy until now.

Here are a few more examples (provided by Ken Whistler) of characters added to Unicode after the initial script adoption that will raise eyebrows for people who speak the relevant language:

  • 01F6 LATIN SMALL LETTER N WITH GRAVE: shows up in NFC pinyin data for Chinese.
  • 0219 LATIN SMALL LETTER S WITH COMMA BELOW: Romanian data.
  • 0450 CYRILLIC SMALL LETTER IE WITH GRAVE: Macedonian in NFC.
  • 0653..0655 Arabic combining maddah and hamza: Implicated in NFC normalization of common Arabic letters now.
  • 0972 DEVANAGARI LETTER CANDRA A: Marathi.
  • 097B DEVANAGARI LETTER GGA: Sindhi.
  • 0B35 ORIYA LETTER VA: Oriya.
  • 0BB6 TAMIL LETTER SHA: Needed to spell sri.
  • 0D7A..0D7F Malayalam chillu letters: Those will be ubiquitous in Malayalam data, post Unicode 5.1.
  • and a bunch of Chinese additions.

So the moral is this: decouple your application, protocol or specification from a specific version of the Unicode Standard. Allow new characters to be used by people as they come along, and users all around the world will thank you.

This came up again recently in a discussion on the W3C i18n Interest Group list, and I decided to put my thoughts in this post so that I can point people to them easily.

I think HTML4 and HTML5 should continue to support <b> and <i> tags, for backwards compatability, but we should urge caution regarding their use and strongly encourage people to use <em> and <strong> or elements with class="…" where appropriate. (I reworded this 2008-02-01)

Here are a couple of reasons I say that:

  1. I constantly see people misusing these tags in ways that can make localization of content difficult.

    For example, just because and English document may use italicisation for emphasis, document titles and foreign words, it doesn’t hold that a Japanese translation of the document will use a single presentational convention for all three. Japanese authors may avoid both italicization and bolding, since their characters are too complicated to look good in small sizes with these effects. Japanese translators may find that the content communicates better if they use wakiten (boten marks) for emphasis, but corner brackets for 『 document names 』, and guillemets for 《 foreign words 》. These are common Japanese typographic approaches that we don’t use in English.

    The problem is that, if the English author has used <i> tags everywhere (thinking about the presentational rendering he/she wants in English), the Japanese localizer will be unable to easily apply different styling to the different types of text.

    The problem could be avoided if semantic markup is used. If the English author had used <em>..</em> and <span class="doctitle">...</span> and <span class="foreignword">..</span> to distinguish the three cases, it would allow the localizer to easily change the CSS to achieve different effects for these items, one at a time.

    Of course, over time this is equally relevant to pages that are monolingual. Suppose your new corporate publishing guidelines change, and proclaim that bolding is better than italics for document names. With semantically marked up HTML, you can easily change a whole site with one tiny edit to the CSS. In the situation described above, however, you’d have to hunt through every page for relevant <i> tags and change them individually, so that you didn’t apply the same style change to emphasis and foreign words too.

  2. Allowing authors to use <b> and <i> tags is also problematic, in my mind, because it keeps authors thinking in presentational terms, rather than helping them move to properly semantic markup. At the very least, it blurs the ideas. To an author in a hurry, it is also tempting to just slap one of these tags on the text to make it look different, rather than to stop and think about things like consistency and future-proofing. (Yes, I’ve often done it too…)

I always forget how to get around the namespace issue when transforming XHTML files to XHTML using XSL, and it always takes ages for me to figure it out again. So I’m going to make a note here to remind me. This seems to work:

<?xml version="1.0" encoding="UTF-8"?>

<xsl:transform version="2.0"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/02/xpath-functions" xmlns:xdt="http://www.w3.org/2005/02/xpath-datatypes"
xmlns:saxon="http://icl.com/saxon"
<strong>exclude-result-prefixes="saxon fn xs xdt html"</strong>&gt
;

<xsl:output method="xhtml" encoding="UTF-8"
doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" indent="no" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" />

Then you need to refer to elements in the source to be converted by using the html: namespace prefix, eg. <xsl :template match=”html:div”>….</xsl>.

I always have to look up the template that copies everything not fiddled with in the other templates, too, so here it is, for good measure:

<xsl:template match="@*|node()">
	<xsl:copy>
		<xsl:apply-templates select="@*|node()"/>
		</xsl:copy>
	</xsl:template>

>> Use it !

Picture of the page in action.

Although I have a picker already for Arabic, Persian and Urdu, I have developed another that is specifically for inputting Urdu. One reason for this is to reduce the choice of characters so that the user is more likely to select the right character for Urdu (eg. heh goal rather than arabic heh). Another is to provide shortcuts for things like aspirated letters and some common combinations (like the word ‘allah’).

It includes characters used for Urdu in Unicode 5.0. Most of the characters in the Urdu standard UZT 1.01 are included.

The aspirated letters of the alphabet can be entered with a single click. Also, base characters with diacritics can be inserted into the text with a single click where NFC normalisation would produce a single precomposed character.

Letters of the alphabet are shown in alphabetic order at the top left, digits are in keypad order, and combining characters related to vowel sounds are shown along the bottom. The lower middle section contains useful but non-alphabetic characters and punctuation. To the right are various symbols. Hinting is implemented for visually similar glyphs.

>> Use it !

Picture of the page in action.

Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more useable than a regular character map utility.

The Bengali picker includes all the characters in the Unicode 5.0 Bengali block. Note: There was an important addition to the Bengali block in version 4.1, a single character for khanda ta, that may not yet be supported in fonts, but has been added to this version of the picker.

Consonants are mostly in a typical articulatory arrangement, vowels are aligned with vowel signs, and digits are in keypad order. Hinting is implemented for visually similar glyphs.

A function has also been added to transliterate Bengali text to Latin, though the scheme used is not standard, and may change at short notice. Don’t use it in anger yet.

I’ve been wanting to improve the editing behaviour of my pickers for quite some time, so that users could interact more easily with the keyboard, and insert characters into the middle of a composition, not just at the end. In fact, the output area maintains the focus all the time, now - which makes a major improvement to the usability of the pickers.

This week I made those things happen, and created a new template with some other changes, too.

An updated Bengali picker is first out of the box, but look out for a brand new Urdu-specific picker to follow close on its heels. I will retrofit the new template to other pickers as time allows, or need dictates.

I also beefed up the font selection list with a large number of TT and OT fonts, and improved the reference material at the bottom.

I improved the mechanism that highlights similar characters, to give more fine-grained control to the associations between characters.

I also added a field just under the title that gives information about the character the user is mousing over, and added a search field to help users find characters for which they know the Unicode name or number. I plan to extend the information associated with characters in future to include native names (eg. e-kar) and other useful search info.

I also changed the scripting and HTML so that a single click can now produce multiple characters in the composition field. This will allow users to input ligatures like the indic ‘ksha’ or Urdu aspirated consonants, or complex sequences tied to ligatures (like the word ‘Allah’) with a simple click.

Some things have also been removed. There is no DEL button now, since you can interact more easily with the keyboard for that. Spaces are available from the (now rationalised) character area, rather than a button. And there is no longer an option to switch between graphics and characters for the selection. This is partly for simplicity, and partly to make it easier to represent some of the slightly more complicated selection options I want to add in future - for example, specific shapes are appropriate for Urdu arabic characters, and I don’t want to leave it to chance as to whether the user’s system has the right fonts to produce the desired shapes.

Getting to this actually required a huge amount of unseen work, since I had to wrap all the images in button markup and move and change attributes, etc. so that the composition box retains the focus in IE (it worked fine for Firefox, Opera and Safari). I also, of course, made significant, but probably not noticeable, changes to the Javascript and CSS.

I just read a post by Ivan Herman about how Hungary has joined the Schengen Agreement, and will soon be removing border controls on the EU side. That put me in mind of the first time I tried to pass through the Iron Curtain.

I was travelling from Vienna to Budapest (probably about 25 years ago) and I had decided to go through Sopron, rather than Hegyeshalom, so I could see something a little more off the beaten track. This was a month-long InterRail trip, so I was able to follow my whim and jump on whatever train I wanted. The train connections worked, and I found myself heading south from Vienna.

Eventually, the train passed into Hungary and stopped. I needed a visa, so I got off with a bunch of other (Hungarian looking) people, and traipsed over to a small outbuilding, where I found myself at the back of a queue of people jostling bags of various sizes and dressed and coiffed in what looked to me to be a very Eastern European fashion. Looking out of the window, everything was grey. I could see rail tracks and points and small, grey buildings but also several very tall towers with machine gun nests perched on top (quite large looking machine guns). The queue moved slowly, and I was surprised at one point to see my train pulling away and disappearing. It seemed a bit odd (and I was glad I’d brought all my stuff with me), but I figured this was probably normal, and I’d just have to catch another train.

I finally arrived at the desk and asked for a visa. The guy behind the desk started talking to me in a somewhat animated fashion, but I had no idea what he was saying. I hadn’t learned German yet, and Hungarian was completely incomprehensible to me. I kept trying to explain, politely, in English, that I needed a visa. Finally, he gave me an exasperated look and called someone out of a nearby room. The guy who emerged was huge, bald and intimidatingly business-like. (Some time later I saw the film Midnight Express, and realised that the prison guard and he could have been the same person.) He shouted at me “Nicht visa!”. And I tried to explain, in English, that, yes, I had no visa, but would like to obtain one, please. This didn’t appear to get across clearly, because he simply repeated “Nicht visa!!” several times, increasing in volume.

Finally, the tension broke and gave way to action. He motioned for me to follow him out of the building, and we started walking away across a couple of sets of railway tracks. I noticed, feeling slightly less at ease but still hopeful, that I was flanked by a soldier with a gun on either side. They weren’t exactly giving me encouraging looks, and as I glanced up at the machine gun towers and at the surrounding barbed wire, I began to wish I knew what was happening.

Soon we arrived at the end of a short train. The very last carriage of this train looked like something you’d expect to see in a Wild West film. It had a kind of standing area at each end with a railing, a door into the carriage and steps leading down to the ground on either side. I was ushered up one set of steps and into what turned out to be an empty carriage. The door was shut behind me, and within a minute or so, as I remember it, the train started moving off, in the same direction my earlier train had disappeared. So I wasn’t just being sent back across the border.

That last realisation started to trouble me a little, since I still had no visa and no idea what was happening. It didn’t help that there was a small round window in the door at each end of the carriage, through which I could see guards sitting on the steps at each corner, all holding machine guns at the ready. As the towers slid away behind us, night started to fall.

Twenty five years has dulled the memory of some of what happened next, but eventually I got off at a small station, having reached the end of the line. The guards were gone, and the station turned out to be quite modern and clean looking. I still couldn’t understand anything anyone was saying, so I still had no idea where I was, but I was able figure out that I was somehow back in Austria. It was much later that I was to realise that Sopron is on a peninsular that sticks into Austria, and I had come in one side and been sent out the other.

I slept that night on the floor of the main station building, and the next morning set off to find someone who spoke English and could tell me where I was - and just as importantly how to get into Hungary. The town was quite small, maybe just a village. In spite of that it took me a while, but I eventually came across a chap in a supermarket who was able to explain to me that visas are not issued on entry into Hungary by train via Sopron. I was ahead of him there. He also offered to drive me to the border, telling me that I would be able to get a visa at the road entry point.

It’s nice to think about that person whenever I relive this story. He really went out of his way, leaving work to assist a complete stranger, with no fuss or thought for reward. I wonder whether he remembers me. I doubt it. Of course, these days he may even be reading this blog post…

So it was that, eventually, I got the stamp in my passport that I needed, and somehow found my way onto another train heading for Budapest. Well, it wasn’t quite the end of the fun. That continued when I tried to meet up with my father in the capital. But that, as they say, is another story…

I was in Prague for a face-to-face meeting of the ITS (International Tag Set) Working Group in October, and I finally created a couple of sets of photos of the town that I took in my free time.

It’s such a photogenic place, you can almost point the camera in any direction and click the shutter and come away with interesting photos. I didn’t have a huge amount of time, so I stayed in the main tourist track, although this time I did make it up three towers - definitely a good move.


There are 2 new sets of photos:

  • East of the Vltava This covers the Staré Město (old town), including photos from the top of the clock tower and the top of Prašná brána (Powder Gate).
  • West of the Vltava Malá Strana and the Praský Hrad complex, with views back over the city from the top of the cathedral tower.

I was in Hyderabad, India in January for a workshop exploring internationalisation issues surrounding the Speech Synthesis Markup Language (SSML), and we got a little time to take photos. (I’m reinstating this post because I accidentally deleted it :( )

There are 2 sets of photos:

  • Charminar etc Charminar is a monument at the centre of Hyderabad. This set also includes photos at the nearby Mecca Masjid, and some tribal dancing.
  • Golconda Fort High on a hill overlooking Hyderabad, there has been a fort here for nearly a thousand years.


Tim Greenwood just pointed out to me a ‘bug’ in my converter program, which I think is actually, in my mind, a bug in Firefox (although I imagine it was implemented by someone as a feature).

If you type A0 (the hex code for a non-breaking space) in the Hexadecimal code points field, then press Convert, you will get a blank space in the Characters field that should be U+00A0 NO-BREAK SPACE. Then press Convert or View Names above this Characters field and you’ll find that what was supposed to be a NBSP has changed into an ordinary space. IE7, Opera and Safari all continue to show the character in the field as a NBSP.

(However, all four browsers substitute an ordinary space when you copy and paste the text from the Characters field into something else.)

I tried this with a range of other types of space , but had no such behaviour (try it). They all remained themselves.

Anyone know what this is about?




Blue Beanie Day

Originally uploaded by r12a

Monday, November 26, 2007 is the day thousands of Standardistas (people who support web standards) will wear a Blue Beanie to show their support for accessible, semantic, and hopefully internationalized web content.

I haven’t got a blue hat, so I cheated a little by borrowing bits of the cover of Jeffrey Zeldman’s great book, “Designing with Web Standards”. That’s me under the hat though.

(If you’re wondering, the text on the left says the same as the text top right, in Arabic, Urdu, Inuktitut, Simplified Chinese, Traditional Chinese, Khazakh, Greek, Dzonkha, Ethiopian, Hebrew, Hindi, Nepali, Japanese, Korean, Hungarian, Punjabi, Thai and Venda.)

See the Flickr pool.


The word Mandalay in Myanmar script.

I’ve been brushing up on the Myanmar script, since major changes are on the way with Unicode 5.1.

I upgraded my myanmar picker to handle the new characters, and I edited my notes on how the script works.

The new characters will make a big difference to how you author text in Unicode, and people will need to update currently existing pages to bring them in line with the new approach. The changes should make it much easier to create content in Burmese, in addition to addressing some niggly problems with making the script work correctly. One reason the changes were sanctioned is that there is currently very little Burmese content out there in Unicode.

I’ll be updating my character by character notes later too.

The only problem with all this is that existing fonts will all need to be changed to support the new world order (or myanmar order). I found one font that is already 5.1 ready from the Myanmar Unicode & NLP Research Center. So if you don’t want to download that font, you’ll need to read the PDF version of my notes on the script.

That would be a pity, however, since i had some fun adding javascript to the article today, so that it displays a breakdown, character by character, of each example as you mouse over it (using images, so you see it properly).

Next Page »