Dochula Pass, Bhutan

The language subtag lookup tool now has links to Wikipedia search for all languages and scripts listed. This helps for finding information about languages, now that SIL limits access to their Ethnologue, and offers a new source of information for the scripts listed.

Picture of the page in action.
>> Use the subtag lookup tool

These are just some notes for future reference. The following scripts in Unicode 9.0 are normally written from right to left.

Scripts containing characters with the property ARABIC RIGHT-to-LEFT have an asterisk. The remaining scripts have characters with the property RIGHT:

In modern use

Adlam
Arabic *
Hebrew
Nko
Syriac *
Thaana *

Limited modern use

Mende Kikakui (small numbers)
Old Hungarian
Samaritan (religious)

Archaic

Avestan
Cypriot
Hatran
Imperial Aramaic
Kharoshthi
Lydian
Manichaean
Meroitic
Mandaic
Nabataean
Old South Arabian
Old North Arabian
Old Turkic
Pahlavi, (Inscriptional)
Palmyrene
Parthian, (Inscriptional)
Pheonician

Picture of the page in action.
>> Use the converter

An updated version of the Unicode Character Converter web app is now available. This app allows you to convert characters between various different formats and notations.

Significant changes include the following:

  • It’s now possible to generate EcmaScript6 style escapes for supplementary characters in the JavaScript output field, eg. \u{10398} rather than \uD800\uDF98.
  • In many cases, clicking on a checkbox option now applies the change straight away if there is content in the associated output field. (There are 4 output fields where this doesn’t happen because we aren’t dealing with escapes and there are problems with spaces and delimiters.)
  • By default, the JavaScript output no longer escapes the ASCII characters that can be represented by \n, \r, \t, \’ and \”. A new checkbox is provided to force those transformations if needed. This should make the JS transform much more useful for general conversions.
  • The code to transform to HTML/XML can now replace RLI, LRI, FSI and PDI if the Convert bidi controls to HTML markup option is set.
  • The code to transform to HTML/XML can convert many more invisible or ambiguous characters to escapes if the Escape invisible characters option is set.
  • UTF-16 code points are all at least 4 digits long.
  • Fixed a bug related to U+00A0 when converting to HTML/XML.
  • The order of the output fields was changed, and various small improvements were made to the user interface.
  • Revamped and updated the notes

Many thanks to the people who wrote in with suggestions.

Picture of the page in action.
>> Use UniView

UniView now supports Unicode version 9, which is being released today, including all changes made during the beta period. (As before, images are not available for the Tangut additions, but the character information is available.)

This version of UniView also introduces a new filter feature. Below each block or range of characters is a set of links that allows you to quickly highlight characters with the property letter, mark, number, punctuation, or symbol. For more fine-grained property distinctions, see the Filter panel.

In addition, for some blocks there are other links available that reflect tags assigned to characters. This tagging is far from exhaustive! For instance, clicking on sanskrit will not show all characters used in Sanskrit.

The tags are just intended to be an aid to help you find certain characters quickly by exposing words that appear in the character descriptions or block subsection titles. For example, if you want to find the Bengali currency symbol while viewing the Bengali block, click on currency and all other characters but those related to currency will be dimmed.

(Since the highlight function is used for this, don’t forget that, if you happen to highlight a useful subset of characters and want to work with just those, you can use the Make list from highlights command, or click on the upwards pointing arrow icon below the text area to move those characters into the text area.)

Picture of the page in action.
>> See the chronology
>> See the maps

This blog post introduces the first of a set of historical maps of Europe that can be displayed at the same scale so that you can compare political or ethnographic boundaries from one time to the next. The first set covers the period from 362 AD to 830 AD.

A key aim here is to allow you to switch from map to map and see how boundaries evolve across an unchanging background.

The information in the maps is derived mostly from information in Colin McEvedy’s excellent series of books, in particular (so far) The New Penguin Atlas of Medieval History, but also sometimes brings in information from the Times History of Europe. Boundaries are approximate for a number of reasons: first, in the earlier times especially, the borders were only approximate anyway, second, I have deduced the boundary information from small-scale maps and (so far) only a little additional research, third, the sources sometimes differ about where boundaries lay. I hope to refine the data during future research, in the meantime take this information as grosso modo.

The link below the picture takes you to a chronological summary of events that lie behind the changes in the maps. Click on the large dates to open maps in a separate window. (Note that all maps will open in that window, and you may have to ensure that it isn’t hidden behind the chronology page.)

The background to the SVG overlay is a map that shows relief and rivers, as well as modern country boundaries (the dark lines). These were things which, as good as McEvedy’s maps were, I was always missing in order to get useful reference points. Since the outlines and text are created in SVG, you can zoom in to see details.

This is just the first stage, and the maps are still largely first drafts. The plan is to refine the details for existing maps and add many more. So far we only deal with Europe. In the future I’d like to deal with other places, if I can find sources.

Picture of the page in action.
>> Use UniView

UniView now supports the characters introduced for the beta version of Unicode 9. Any changes made during the beta period will be added when Unicode 9 is officially released. (Images are not available for the Tangut additions, but the character information is available.)

It also brings in notes for individual characters where those notes exist, if Show notes is selected. These notes are not authoritative, but are provided in case they prove useful.

A new icon was added below the text area to add commas between each character in the text area.

Links to the help page that used to appear on mousing over a control have been removed. Instead there is a noticeable, blue link to the help page, and the help page has been reorganised and uses image maps so that it is easier to find information. The reorganisation puts more emphasis on learning by exploration, rather than learning by reading.

Various tweaks were made to the user interface.

Picture of the page in action.
>> Use the picker

I’ve been doing more work over the weekend.

The data behind the keyword search has now been completely updated to reflect descriptions by Gardiner and Allen. If you work with those lists it should now be easy to locate hieroglyphs using keywords. The search mechanism has also been rewritten so that you don’t need to type keywords in a particular order for them to match. I also strip out various common function words and do some other optimisation before attempting a match.

The other headline news is the addition of various controls above the text area, including one that will render MdC text as a two-dimensional arrangement of hieroglyphs. To do this, I adapted WikiHiero’s PHP code to run in javascript. You can see an example of the output in the picture attached to this post. If you want to try it, the MdC text to put in the text area is:
anx-G5-zmA:tA:tA-nbty-zmA:tA:tA-sw:t-bit:t-< -zA-ra:.-mn:n-T:w-Htp:t*p->-anx-D:t:N17-!

The result should look like this:

Picture of hieroglyphs.

Other new controls allow you to convert MdC text to hieroglyphs, and vice versa, or to type in a Unicode phonetic transcription and find the hieroglyphs it represents. (This may still need a little more work.)

I also moved the help text from the notes area to a separate file, with a nice clickable picture of the picker at the top that will link to particular features. You can get to that page by clicking on the blue Help box near the bottom of the picker.

Finally, you can now set the text area to display characters from right to left, in right-aligned lines, using more controls > Output direction. Unfortunately, i don’t know of a font that under these conditions will flip the hieroglyphs horizontally so that they face the right way.

For more information about the new features, and how to use the picker, see the Help page.

Picture of the page in action.
>> Use the picker

Over the weekend I added a set of new features to the picker for Egyptian Hieroglyphs, aimed at making it easier to locate a particular hieroglyph. Here is a run-down of various methods now available.

Category-based input

This was the original method. Characters are grouped into standard categories. Click on one of the orange characters, chosen as a nominal representative of the class, to show below all the characters in that category. Click on one of those to add it to the output box. As you mouse over the orange characters, you’ll see the name of the category appear just below the output box.

Keyword-search-based input

The app associates most hieroglyphs with keywords that describe the glyph. You can search for glyphs using those keywords in the input field labelled Search for.

Searching for ripple will match both ripple and ripples. Searching for king will match king and walking. If you want to only match whole words, surround the search term with colons, ie. :ripple: or :king:.

Note that the keywords are written in British English, so you need to look for sceptre rather than scepter.

The search input is treated as a regular expression, so if you want to search for two words that may have other words between them, use .*. For example, ox .* palm will match ox horns with stripped palm branch.

Many of the hieroglyphs have also been associated with keywords related to their use. If you select Include usage, these keywords will also be selected. Note that this keyword list is not exhaustive by any means, but it may occasionally be useful. For example, a search for Anubis will produce 𓁢 𓃢 𓃣 𓃤 .

(Note: to search for a character based on the Unicode name for that character, eg. w004, use the search box in the yellow area.)

Searching for pronunciations

Many of the hieroglyphs are associated with 1, 2 or 3 consonant pronunciations. These can be looked up as follows.

Type the sequence of consonants into the output box and highlight them. Then click on Look up from Latin. Hieroglyphs that match that character or sequence of characters will be displayed below the output box, and can be added to the output box by clicking on them. (Note that if you still have the search string highlighted in the output box those characters will be replaced by the hieroglyph.)

You will find the panel Latin characters useful for typing characters that are not accessible via your keyboard. The panel is displayed by clicking on the higher L in the grey bar to the left. Click on a character to add it to the output area.

For example, if you want to obtain the hieroglyph 𓎝, which is represented by the 3-character sequence wꜣḥ, add wꜣḥ to the output area and select it. Then click on Latin characters. You will see the character you need just above the SPACE button. Click on that hieroglyph and it will replace the wꜣḥ text in the output area. (Unhighlight the text in the output area if you want to keep both and add the hierglyph at the cursor position.)

Input panels accessed from the vertical grey bar

The vertical grey bar to the left allows you to turn on/off a number of panels that can help create the text you want.

Latin characters. This panel displays Latin characters you are likely to need for transcription. It is particularly useful for setting up a search by pronunciation (see above).

Latin to Egyptian. This panel also displays Latin characters used for transcription, but when you click on them they insert hieroglyphs into the output area. These are 24 hieroglyphs represented by a single consonant. Think of it as a shortcut if you want to find 1-consonant hieroglyphs by pronunciation.

Where a single consonant can be represented by more than one hieroglyph, a small pop-up will present you with the available choices. Just click on the one you want.

Egyptian alphabet. This panel displays the 26 hieroglyphs that the previous panel produces as hieroglyphs. In many cases this is the quickest way of typing in these hieroglyphs.

Picture of the page in action.
>> Use the picker

I have just published a picker for Egyptian Hieroglyphs.

This Unicode character picker allows you to produce or analyse runs of Egyptian Hieroglyph text using the Latin script.

Characters are grouped into standard categories. Click on one of the orange characters, chosen as a nominal representative of the class, to show below all the characters in that category. Click on one of those to add it to the output box. As you mouse over the orange characters, you’ll see the name of the category appear just below the output box.

Just above the orange characters you can find buttons to insert RLO and PDF controls. RLO will make the characters that follow it to progress from right to left. Alternatively, you can select more controls > Output direction to set the direction of the output box to RTL/LTR override. The latter approach will align the text to the right of the box. I haven’t yet found a Unicode font that also flips the glyphs horizontally as a result. I’m not entirely sure about the best way to apply directionality to Egyptian hieroglyphs, so I’m happy to hear suggestions.

Alongside the direction controls are some characters used for markup in the Manuel de Codage, which allows you to prepare text for an engine that knows how to lay it out two-dimensionally. (The picker doesn’t do that.)

The Latin Characters panel, opened from the grey bar to the left, provides characters needed for transcription.

In case you’re interested, here is the text you can see in the picture. (You’ll need a font to see this, of course. Try the free Noto Sans font, if you don’t have one – or copy-paste these lines into the picker, where you have a webfont.)
𓀀𓅃𓆣𓁿
<-i-mn:n-R4:t*p->
𓍹𓇋-𓏠:𓈖-𓊵:𓏏*𓊪𓍺

The last two lines spell the name of Amenhotep using Manuel de Codage markup, according to the Unicode Standard (p 432).

I just received a query from someone who wanted to know how to figure out what characters are in and what characters are not in a particular legacy character encoding. So rather than just send the information to her I thought I’d write it as a blog post so that others can get the same information. I’m going to write this quickly, so let me know if there are parts that are hard to follow, or that you consider incorrect, and I’ll fix it.

A few preliminary notes to set us up: When I refer to ‘legacy encodings’, I mean any character encoding that isn’t UTF-8. Though, actually, I will only consider those that are specified in the Encoding spec, and I will use the data provided by that spec to determine what characters each encoding contains (since that’s what it aims to do for Web-based content). You may come across other implementations of a given character encoding, with different characters in it, but bear in mind that those are unlikely to work on the Web.

Also, the tools I will use refer to a given character encoding using the preferred name. You can use the table in the Encoding spec to map alternative names to the preferred name I use.

What characters are in encoding X?

Let’s suppose you want to know what characters are in the character encoding you know as cseucpkdfmtjapanese. A quick check in the Encoding spec shows that the preferred name for this encoding is euc-jp.

Go to http://r12a.github.io/apps/encodings/ and look for the selection control near the bottom of the page labelled show all the characters in this encoding.

Select euc-jp. It opens a new window that shows you all the characters.

picture of the result

This is impressive, but so large a list that it’s not as useful as it could be.

So highlight and copy all the characters in the text area and go to https://r12a.github.io/apps/listcharacters/.

Paste the characters into the big empty box, and hit the button Analyse characters above.

This will now list for you those same characters, but organised by Unicode block. At the bottom of the page it gives a total character count, and adds up the number of Unicode blocks involved.

picture of the result

What characters are not in encoding X?

If instead you actually want to know what characters are not in the encoding for a given Unicode block you can follow these steps.

Go to UniView (http://r12a.github.io/uniview/) and select the block you are interested where is says Show block, or alternatively type the range into the control labelled Show range (eg. 0370:03FF).

Let’s imagine you are interested in Greek characters and you have therefore selected the Greek and Coptic block (or typed 0370:03FF in the Show range control).

On the edit buffer area (top right) you’ll see a small icon with an arrow point upwards. Click on this to bring all the characters in the block into the edit buffer area. Then hit the icon just to its left to highlight all the characters and then copy them to the clipboard.

picture of the result

Next open http://r12a.github.io/apps/encodings/ and paste the characters into the input area labelled with Unicode characters to encode, and hit the Convert button.

picture of the result

The Encoding converter app will list all the characters in a number of encodings. If the character is part of the encoding, it will be represented as two-digit hex codes. If not, and this is what you’re looking for, it will be represented as decimal HTML escapes (eg. &#880;). This way you can get the decimal code point values for all the characters not in the encoding. (If all the characters exist in the encoding, the block will turn green.)

(If you want to see the list of characters, copy the results for the encoding you are interested in, go back to UniView and paste the characters into the input field labelled Find. Then click on Dec. Ignore all ASCII characters in the list that is produced.)

Note, by the way, that you can tailor the encodings that are shown by the Encoding converter by clicking on change encodings shown and then selecting the encodings you are interested in. There are 36 to choose from.

Picture of the page in action.
>> Use the picker

Following closely on the heels of the Old Norse and Runic pickers comes a new Old English (Anglo-Saxon) picker.

This Unicode character picker allows you to produce or analyse runs of Old English text using the Latin script.

In addition to helping you to type Old English latin-based text, the picker allows you to automatically generate phonetic and runic transcriptions. These should be used with caution! The transcriptions are only intended to be a rough guide, and there may occasionally be slight inaccuracies that need patching.

The picture in this blog post shows examples of old english text, and phonetic and runic transcriptions of the same, from the beginning of Beowulf. Click on it to see it larger, or copy-paste the following into the picker, and try out the commands on the top right: Hwæt! wē Gār-Dena in ġēar-dagum þēod-cyninga þrym gefrūnon, hūðā æþelingas ellen fremedon.

If you want to work more with runes, check out the Runic picker.

Picture of the page in action.
>> Use the picker

Character pickers are especially useful for people who don’t know a script well, as characters are displayed in ways that aid identification. These pickers also provide tools to manipulate the text.

The Runic character picker allows you to produce or analyse runs of Runic text. It allows you to type in runes for the Elder fuþark, Younger fuþark (both long-branch and short-twig variants), the Medieval fuþark and the Anglo-Saxon fuþork. To help beginners, each of the above has its own keyboard-style layout that associates the runes with characters on the keyboard to make it easier to locate them.

It can also produce a latin transliteration for a sequence of runes, or automatically produce runes from a latin transliteration. (Note that these transcriptions do not indicate pronunciation – they are standard latin substitutes for graphemes, rather than actual Old Norse or Old English, etc, text. To convert Old Norse to runes, see the description of the Old Norse pickers below. This will soon be joined by another picker which will do the same for Anglo-Saxon runes.)

Writing in runes is not an exact science. Actual runic text is subject to many variations dependent on chronology, location and the author’s idiosyncracies. It should be particularly noted that the automated transcription tools provided with this picker are intended as aids to speed up transcription, rather than to produce absolutely accurate renderings of specific texts. The output may need to be tweaked to produce the desired results.

You can use the RLO/PDF buttons below the keyboard to make the runic text run right-to-left, eg. ‮ᚹᚪᚱᚦᚷᚪ‬, and if you have the right font (such as Junicode, which is included as the default webfont, or a Babelstone font), make the glyphs face to the left also. The Bablestone fonts also implement a number of bind-runes for Anglo-Saxon (but are missing those for Old Norse) if you put a ZWJ character between the characters you want to ligate. For example: ᚻ‍ᛖ‍ᛚ. You can also produce two glyphs mirrored around the central stave by putting ZWJ between two identical characters, eg. ᚢ‍ᚢ. (Click on the picture of the picker in this blog post to see examples.)

Picture of the page in action.
>> Use the picker

The Old Norse picker allows you to produce or analyse runs of Old Norse text using the Latin script. It is based on a standardised orthography.

In addition to helping you to type Old Norse latin-based text, the picker allows you to automatically generate phonetic and runic transcriptions. These should be used with caution! The phonetic transcriptions are only intended to be a rough guide, and, as mentioned earlier, real-life runic text is often highly idiosyncratic, not to mention that it varies depending on the time period and region.

The runic transcription tools in this app produce runes of the Younger fuþark – used for Old Norse after the Elder and before the Medieval fuþarks. This transcription tool has its own idiosyncracies, that may not always match real-life usage of runes. One particular idiosyncracy is that the output always regularly conforms to the same set of rules, but others include the decision not to remove homorganic nasals before certain following letters. More information about this is given in the notes.

You can see an example of the output from these tools in the picture of the Old Norse picker that is attached to this blog post. Here’s some Old Norse text you can play with: Ok sem leið at jólum, gørðusk menn þar ókátir. Bǫðvarr spurði Hǫtt hverju þat sætti; hann sagði honum at dýr eitt hafi komit þar tvá vetr í samt, mikit ok ógurligt.

The picker also has a couple of tools to help you work with A New Introduction to Old Norse.

Picture of the page in action.
>> Use the app

This app allows you to see how Unicode characters are represented as bytes in various legacy encodings, and vice versa. You can customise the encodings you want to experiment with by clicking on change encodings shown. The default selection excludes most of the single-byte encodings.

The app provides a way of detecting the likely encoding of a sequence of bytes if you have no context, and also allows you to see which encodings support specific characters. The list of encodings is limited to those described for use on the Web by the Encoding specification.

The algorithms used are based on those described in the Encoding specification, and thus describe the behaviour you can expect from web browsers. The transforms may not be the same as for other conversion tools. (In some cases the browsers may also produce a different result than shown here, while the implementation of the spec proceeds. See the tests.)

Encoding algorithms convert Unicode characters to sequences of double-digit hex numbers that represent the bytes found in the target character encoding. A character that cannot be handled by an encoder will be represented as a decimal HTML character escape.

Decoding algorithms take the byte codes just mentioned and convert them to Unicode characters. The algorithm returns replacement characters where it is unable to map a given byte to the encoding.

For the decoder input you can provide a string of hex numbers separated by space or by percent signs.

Green backgrounds appear behind sequences where all characters or bytes were successfully mapped to a character in the given encoding. Beware, however, that the character mapped to may not be the one you expect – especially in the single byte encodings.

To identify characters and look up information about them you will find UniView extremely useful. You can paste Unicode characters into the UniView Edit Buffer and click on the down-arrow icon below to find out what they are. (Click on the name that appears for more detailed information.) It is particularly useful for identifying escaped characters. Copy the escape(s) to the Find input area on UniView and click on Dec just below.

Picture of the page in action.
>> Use the picker

An update to version 17 of the Mongolian character picker is now available.

When you hover over or select a character in the selection area, the box to the left of that area displays the alternate glyph forms that are appropriate for that character. By default, this only happens when you click on a character, but you can make it happen on hover by clicking on the V in the gray selection bar to the right.

The list includes the default positional forms as well as the forms produced by following the character with a Free Variation Selector (FVS). The latter forms have been updated, based on work which has been taking place in 2015 to standardise the forms produced by using FVS. At the moment, not all fonts will produce the expected shapes for all possible combinations. (For more information, see Notes on Mongolian variant forms.)

An additional new feature is that when the variant list is displayed, you can add an appropriate FVS character to the output area by simply clicking in the list on the shape that you want to see in the output.

This provides an easy way to check what shapes should be produced and what shapes are produced by a given font. (You can specify which font the app should use for display of the output.)

Some small improvements were also made to the user interface. The picker works best in Firefox and Edge desktop browsers, since they now have pretty good support for vertical text. It works least well in Safari (which includes the iPad browsers).

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

Picture of the page in action.
>> Use UniView

This update allows you to link to information about Han characters and Hangul syllables, and fixes some bugs related to the display of Han character blocks.

Information about Han characters displayed in the lower right area will have a link View data in Unihan database. As expected, this opens a new window at the page of the Unihan database corresponding to this character.

Han and hangul characters also have a link View in PDF code charts (pageXX). On Firefox and Chrome, this will open the PDF file for that block at the page that lists this character. (For Safari and Edge you will need to scroll to the page indicated.) The PDF is useful if there is no picture or font glyph for that character, but also allows you to see the variant forms of the character.

For some Han blocks, the number of characters per page in the PDF file varies slightly. In this case you will see the text approx; you may have to look at a page adjacent to the one you are taken to for these characters.

Note that some of the PDF files are quite large. If the file size exceeds 3Mb, a warning is included.

Picture of the page in action.

>> Use UniView

Unicode 8.0.0 is released today. This new version of UniView adds the new characters encoded in Unicode 8.0.0 (including 6 new scripts). The scripts listed in the block selection menu were also reordered to match changes to the Unicode charts page.

The URL for UniView is now https://r12a.github.io/uniview/. Please change your bookmarks.

The github site now holds images for all 28,000+ Unicode codepoints other than Han ideographs and Hangul syllables (in two sizes).

I also fixed the Show Age filter, and brought it up to date.

Three bopomofo letters with tone mark.

Light tone mark in annotation.

A key issue for handling of bopomofo (zhùyīn fúhào) is the placement of tone marks. When bopomofo text runs vertically (either on its own, or as a phonetic annotation), some smarts are needed to display tone marks in the right place. This may also be required (though with different rules) for bopomofo when used horizontally for phonetic annotations (ie. above a base character), but not in all such cases. However, when bopomofo is written horizontally in any other situation (ie. when not written above a base character), the tone mark typically follows the last bopomofo letter in the syllable, with no special handling.

From time to time questions are raised on W3C mailing lists about how to implement phonetic annotations in bopomofo. Participants in these discussions need a good understanding of the various complexities of bopomofo rendering.

To help with that, I just uploaded a new Web page Bopomofo on the Web. The aim is to provide background information, and carry useful ideas from one discussion to the next. I also add some personal thoughts on implementation alternatives, given current data.

I intend to update the page from time to time, as new information becomes available.

Screen Shot 2015-01-18 at 07.42.56

Version 16 of the Bengali character picker is now available.

Other than a small rearrangement of the selection table, and the significant standard features that version 16 brings, this version adds the following:

  • three new buttons for automatic transcription between latin and bengali. You can use these buttons to transcribe to and from latin transcriptions using ISO 15919 or Radice approaches.
  • hinting to help identify similar characters.
  • the ability to select the base character for the display of combining characters in the selection table.

For more information about the picker, see the notes at the bottom of the picker page.

In addition, I made a number of additions and changes to Bengali script notes (an overview of the Bengali script), and Bengali character notes (an annotated list of characters in the Bengali script).

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

initial-letter-tibetan-01

The CSS WG needs advice on initial letter styling in non-Latin scripts, ie. enlarged letters or syllables at the start of a paragraph like those shown in the picture. Most of the current content of the recently published Working Draft, CSS Inline Layout Module Level 3 is about styling of initial letters, but the editors need to ensure that they have covered the needs of users of non-Latin scripts.

The spec currently describes drop, sunken and raised initial characters, and allows you to manipulate them using the initial-letter and the initial-letter-align properties. You can apply those properties to text selected by ::first-letter, or to the first child of a block (such as a span).

The editors are looking for

any examples of drop initials in non-western scripts, especially Arabic and Indic scripts.

I have scanned some examples from newspapers (so, not high quality print).

In the section about initial-letter-align the spec says:

Input from those knowledgeable about non-Western typographic traditions would be very helpful in describing the appropriate alignments. More values may be required for this property.

Do you have detailed information about initial letter styling in a non-Latin script that you can contribute? If so, please write to www-style@w3.org (how to subscribe).

I’m struggling to show combining characters on a page in a consistent way across browsers.

For example, while laying out my pickers, I want users to be able to click on a representation of a character to add it to the output field. In the past I resorted to pictures of the characters, but now that webfonts are available, I want to replace those with font glyphs. (That makes for much smaller and more flexible pages.)

Take the Bengali picker that I’m currently working on. I’d like to end up with something like this:

comchacon0

I put a no-break space before each combining character, to give it some width, and because that’s what the Unicode Standard recommends (p60, Exhibiting Nonspacing Marks in Isolation). The result is close to what I was looking for in Chrome and Safari except that you can see a gap for the nbsp to the left.

comchacon1

But in IE and Firefox I get this:

comchacon2

This is especially problematic since it messes up the overall layout, but in some cases it also causes text to overlap.

I tried using a dotted circle Unicode character, instead of the no-break space. On Firefox this looked ok, but on Chrome it resulted in two dotted circles per combining character.

I considered using a consonant as the base character. It would work ok, but it would possibly widen the overall space needed (not ideal) and would make it harder to spot a combining character by shape. I tried putting a span around the base character to grey it out, but the various browsers reacted differently to the span. Vowel signs that appear on both sides of the base character no longer worked – the vowel sign appeared after. In other cases, the grey of the base character was inherited by the whole grapheme, regardless of the fact that the combining character was outside the span. (Here are some examples ে and ো.)

In the end, I settled for no preceding base character at all. The combining character was the first thing in the table cell or span that surrounded it. This gave the desired result for the font I had been using, albeit that I needed to tweak the occasional character with padding to move it slightly to the right.

On the other hand, this was not to be a complete solution either. Whereas most of the fonts I planned to use produce the dotted circle in these conditions, one of my favourites (SolaimanLipi) doesn’t produce it. This leads to significant problems, since many combining characters appear far to the left, and in some cases it is not possible to click on them, in others you have to locate a blank space somewhere to the right and click on that. Not at all satisfactory.

comchacon3

I couldn’t find a better way to solve the problem, however, and since there were several Bengali fonts to choose from that did produce dotted circles, I settled for that as the best of a bad lot.

However, then i turned my attention to other pickers and tried the same solution. I found that only one of the many Thai fonts I tried for the Thai picker produced the dotted circles. So the approach here would have to be different. For Khmer, the main Windows font (Daunpenh) produced dotted circles only for some of the combining characters in Internet Explorer. And on Chrome, a sequence of two combining characters, one after the other, produced two dotted circles…

I suspect that I’ll need to choose an approach for each picker based on what fonts are available, and perhaps provide an option to insert or remove base characters before combining characters when someone wants to use a different font.

It would be nice to standardise behaviour here, and to do so in a way that involves the no-break space, as described in the Unicode Standard, or some other base character such as – why not? – the dotted circle itself. I assume that the fix for this would have to be handled by the browser, since there are already many font cats out of the bag.

Does anyone have an alternate solution? I thought I heard someone at the last Unicode conference mention some way of controlling the behaviour of dotted circles via some script or font setting…?

Update: See Marc Durdin’s blog for more on this topic, and his experiences while trying to design on-screen keyboards for Lao and other scripts.

Next Page »