This document contains examples in another language or script.

Accesskey n skips to in page navigation. Skip to the content start

ishida >> writing

An Introduction to Writing Systems:
A review of script characteristics affecting computer-based script support and Unicode

Front matter

Intended audience

Anyone who wants to better understand how scripts work in computerised environments, and more particularly with regards to Unicode. The material should be accessible for a wide audience, from software engineers to managers.

While the tutorial is perfectly accessible to beginners, it has also attracted very good reviews from people at an intermediate and advanced level, due to the breadth of scripts discussed. No previous knowledge is assumed.

Why should you read this?

When planning to introduce products into new markets it is important to understand the impact of having to support different scripts. The tutorial will make clear that this is not usually a trivial issue, and if you need to implement support, it may involve decisions at a very early stage in the design process.

This tutorial is particularly useful for people who are new to Unicode, in that it provides an overview of the basics in the context of real examples.

Objectives

The tutorial will provide you with an understanding of key requirements for implementing writing systems in information technology. It will do this by examining real examples of a wide range of modern scripts to discover features that a computerized implementation must support. It will also make special reference, where appropriate, to how the Unicode Standard points the way forward for meeting these requirements.

The tutorial does not provide detailed coding advice, but does provide the essential background information you need to understand the fundamental issues. It will also constitute an excellent orientation for newcomers to the topic, providing a wide-ranging framework that assists in assimilating further, more detailed and specific information.

Naturally, given the tutorial format this is an ambitious approach, and it will mean that we cannot go into great detail on any particular topic. If you would like to understand a topic better, there are a couple of excellent resources cited at the end of the tutorial, one of which is the very readable Unicode Standard itself.

How to use this material

This material is organized around a set of presentation slides which can be viewed in several ways. Each view is identified by an icon as described below.

Icon for viewing the slide by slide version. All in one A single page containing all explanatory text followed by small accompanying slides.

Icon for viewing the slide by slide version. Slide by slide One page per slide view. This is particularly useful if you need to see the detail on a slide.

Icon for viewing the text version. Slide text This page by page version of the slides is provided mainly for those who want to cut and paste the text on the slides. (You will need appropriate fonts and rendering software to see the text correctly.)

Please send any comments to ishida@w3.org.

Scripts addressed and Reference examples

We will organize the material in the tutorial by concept, rather than by script. To help you, the script or scripts to which the concept applies will always be listed at the top right of the slide.

The list of scripts includes:

The tutorial covers most of the key features of each of these scripts.

There is a set of web pages with sample text in each of the scripts we will address. Each of the sample pages is a translation of the same English text. We will use these samples to illustrate as many of the points made as possible. That way you will be able to experiment with the examples yourself. In fact, where I have taken an example from a sample page I have typically included the text of that sample on the slide to help you locate real instances more easily.

If you use these examples for your own material, please ensure that you cite this paper and the web site as a source reference.

slide Go to individual slides view. View text for this slide. Go to overview.

Large character sets

CJK character sets

Chinese

Initially there was only one type of Chinese – what we now call Traditional Chinese. Then in the 1950s Mainland China introduced a Simplified Chinese. It was simplified in two ways:

  1. the more common character shapes were reduced in complexity,

  2. a relatively smaller set of characters was defined for common usage than had traditionally been the case (resulting in the mapping more than one character in Traditional Chinese to a single character in the Simplified Chinese set).

This slide shows Traditional Chinese above and Simplified Chinese below.

Traditional Chinese is still used to write characters in Taiwan and Hong Kong, and much of the Chinese diaspora. Simplified Chinese is used in Mainland China and Singapore. It is important to stress that people speaking many different, often mutually unintelligible, Chinese dialects would use one or other of these scripts to write Chinese – ie. the characters do not necessarily represent the sounds.

There are a few local characters, such as for Cantonese in Hong Kong, that are not in widespread use. In Chinese these ideographs are called hanzi. They are often referred to as Han characters.

There is another script used with Traditional Chinese for annotations and transliteration during input. It is called zhuyin or bopomofo, and will be described in more detail later.

It is said that Chinese people typically use around 3-4,000 characters for most communication, but a reasonable word processor would need to support at least 10,000. Unicode supports over 70,000 Han characters.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows examples of contrasting shapes in Traditional and Simplified ideographs.

The characters on the left are one ideograph; the characters on the right are another. Characters at the top are Traditional shapes; characters at the bottom are Simplified.

Note that each of the large glyphs shown above is a separate code point in Unicode. The Simplified and Traditional shapes are not unified unless they are extremely similar. (Han unification will be explained in more detail later.)

slide Go to individual slides view. View text for this slide. Go to overview.

Japanese

Japanese uses three native scripts in addition to Latin (called romaji), and mixes them all together.

Top right on the slide is an example of ideographic characters, borrowed from Chinese, which in Japanese are called kanji. Kanji characters are used principally for the roots of words.

The example at the top left of the slide is written entirely in hiragana. Hiragana is a native Japanese syllabic script typically used for many indigenous Japanese words (as in this case) and for grammatical particles and endings. The example at the bottom of the slide shows its use to express grammatical information alongside a kanji character (the darker, initial character) that expresses the root meaning of the word.

Japanese everyday usage requires around 2,000 kanji characters – although Japanese character sets include many thousands more.

slide Go to individual slides view. View text for this slide. Go to overview.

The example at the bottom of this slide shows the katakana script. This is used for foreign loan words in Japanese. The example reads ‘te-ki-su-to’, ie. ‘text’.

slide Go to individual slides view. View text for this slide. Go to overview.

On this slide we see the more common characters from the hiragana (left) and katakana (right) syllabaries arranged in traditional order. A character in the same location in each table is pronounced exactly the same.

With the exception of the vowels on the top line and the letter ‘n’, all of the symbols represent a consonant followed by a vowel.

Voiced consonants are indicated by attaching a dakuten mark to the unvoiced shape. The ‘p’ sound is indicated by the use of a han-dakuten (compare glyphs for ‘ha’, ‘ba’, and ‘pa’).

A small ‘tsu’ (っ) is commonly used to lengthen a consonant sound.

Small versions of や, ゆ, and よ are used to form syllables such as ‘kya’ (きゃ), ‘kyu’ (きゅ), and ‘kyo’ (きょ) respectively.

When writing katakana the mark ー is used to indicate a lengthened vowel.

slide Go to individual slides view. View text for this slide. Go to overview.

The example at the top of the slide shows the small tsu being used in katakana to lengthen the ‘t’ sound that follows it. This can be transcribed as ‘intanetto’.

The bottom example shows usage of other small versions of katakana characters. The transcription is ‘konpyuutingu’. In the first case the small ‘yu’ combines with the preceding ‘pi’ to produce ‘pyu’. In the second case the small ‘i’ is used with the preceding ‘te’ syllable to produce ‘ti’ – a sound that is not native to Japanese. (Their equivalent would be ‘chi’.)

The bottom example also shows the use of the han-dakuten and dakuten to turn ‘hi’ into ‘pi’ and ‘ku’ into ‘gu’.

There is also a lengthening mark that lengthens the ‘u’ sound before it.

slide Go to individual slides view. View text for this slide. Go to overview.

Korean

Korean uses a unique script called hangul. It is unique in that, although it is a syllabic script, the individual phonemes within a syllable are represented by individual shapes. The example shows how the word ‘ta-kuk-o’ is composed of 7 jamos, each expressing a single phoneme. The jamos are displayed as part of a two dimensional syllabic character.

Note that the initial jamo in the last syllable is not pronounced in initial position and serves purely to conform to the rule that hangul syllables always begin with a consonant.

It is possible to store hangul text as either jamos or syllabic characters in Unicode, although the latter is more common. Unicode enables both approaches.

South Korea also mixes ideographic characters borrowed from Chinese with hangul, though on nothing like the scale of Japanese. In fact, it is quite normal to find whole documents without any hanja, as the ideographic characters in Korean are called.

There are about 2,300 hangul characters in everyday use, but the Unicode Standard has code points for around 11,000.

slide Go to individual slides view. View text for this slide. Go to overview.

Visual characteristics

Note how because all the characters above are mono-spaced and fit within the same sized box the text on the slide gives the appearance of a grid. Grid layouts are actually a common typographic convention in East Asian scripts.

When half-width or proportionally-spaced characters are introduced, there is a possibility of this grid being corrupted, but typographic devices are available to provide several possible solutions to this.

You can experiment with various types of grid setting using CSS on the following web pages:

slide Go to individual slides view. View text for this slide. Go to overview.

Han and kana characters are usually full-width, whereas latin text is half-width or proportionally spaced.

Half-width katakana characters do exist, and for compatibility reasons there is a Unicode block for half-width kana characters. These codes should not normally be used, however. They arise from the early computing days when Japanese had to be fitted into a Western-biased technology.

Similarly, it is common to find full-width Latin text, especially in tables. Again, there is a Unicode block dedicated to full width Latin characters and punctuation, but a font should be used instead.

slide Go to individual slides view. View text for this slide. Go to overview.

Radicals

A radical is an ideograph or a component of an ideograph that is used for indexing dictionaries and word lists, and as the basis for creating new ideographs. The 214 radicals of the KangXi dictionary are universally recognised.

The examples enlarged on the slide show the ideographic character meaning ‘word’, ‘say’ or ‘speak’ (bottom left), and three more characters that use this as a radical on their left hand side.

slide Go to individual slides view. View text for this slide. Go to overview.

The visual appearance of radicals may vary significantly.

Here the radical shown on the previous slide is seen as used in Simplified Chinese (top right). Although the shape differs somewhat it still represents the same radical.

On the bottom row we see the ‘water’ radical being used in two different positions in a character, and with two different shapes. This time the right-most example is found in both simplified and traditional forms.

slide Go to individual slides view. View text for this slide. Go to overview.

Unicode dedicates two blocks to radicals. The KangXi radicals block depicted here contains the base forms of the 214 radicals.

The CJK Radicals Extension contains variant shapes of these radicals when they are used as parts of other characters or in simplified form. These have not been unified because they often appear independently in dictionaries indices.

Characters in these blocks should never be used as ideographs.

slide Go to individual slides view. View text for this slide. Go to overview.

Implementing multi-byte characters

In the early days of computing a byte consisting of 7 bits; allowing for a code page containing 128 code points. This was the day of ASCII.

slide Go to individual slides view. View text for this slide. Go to overview.

When bytes contained 8 bits they gave rise to code pages containing 256 code points. These code pages typically retain the ASCII characters in the lower 128 code points and add characters for additional languages to the upper reaches. On the slide we see a ‘Latin1’ code page, ISO 8859-1, containing code points for Western European languages.

slide Go to individual slides view. View text for this slide. Go to overview.

Unfortunately, 256 code points was not enough even to support the whole of Europe – not even Latin based languages such as Turkish, Hungarian, etc. To support Greek characters you might see the code points re-mapped as shown on the slide (left hand side). These alternative code pages forced you to maintain contextual information so that you could determine the intended character from the upper ranges of the code page. It also made localization difficult since you had to keep changing code pages.

slide Go to individual slides view. View text for this slide. Go to overview.

East Asian computing immediately faced a much bigger problem than in Europe, as can be seen by the size of these common character sets. They resorted to double-byte coded character sets. Two-byte character sets provided 16 bits, and would allow for 216 (ie. 65,356) possible code points. In reality these character sets tended to be based on a 7-bit model, utilizing only a part of the total space available.

One particular problem persisted here – these character sets and their encodings were script specific. It was still difficult to represent Chinese, Korean and Japanese text simultaneously.

slide Go to individual slides view. View text for this slide. Go to overview.

Unicode and ISO 10646 from the start enabled the use of the full range of 16-bit code points for their Basic Multilingual Plane (BMP). This meant that all of the above scripts and more could be represented simultaneously with ease. Localization also became easier, since there was no need to enable new code pages or switch encodings – you simply began using the characters in the appropriate part of the BMP.

slide Go to individual slides view. View text for this slide. Go to overview.

Recently Unicode and ISO 10646 have defined 16 supplementary planes, each the same size as the BMP, for future expansion. Some of those planes are being populated already. There are code points defined for additional alphabets and a large number of math characters in the Supplementary Multilingual Plane (SMP). Also a large number of additional ideographic characters have been added to the Supplementary Ideographic Plane (SIP).

In total there are now over one million code points available.

slide Go to individual slides view. View text for this slide. Go to overview.

Unification

Unicode provides a superset of most character sets in use around the world, but tries not to duplicate characters unnecessarily. For example, there are several ISO character sets in the 8859 range that all duplicate the ASCII characters. Unicode doesn't have as many codes for the letter 'a' as there are character sets - that would make for a huge and confusing character set.

The same principal applies for Han (Chinese) characters. The initial set of sources for Han encoding in Unicode laid end to end comprised 121,000 characters, but there were many repeats, and the final Unicode tally for all these after elimination of duplicates was 20,902. (There are now over 70,000 Han characters encoded in Unicode.)

If Han characters had different meanings or etymologies, they were not unified. Han characters, however, are highly pictorial in nature. So the (dis-) unification process had to take into account the visual forms to some extent. Where there was a significant visual difference between han characters that represented the same thing they were allotted to separate Unicode code points. (Unifying the Han characters is a sophisticated process, carried out over a long period by many East-Asian experts.)

Factors such as those shown on this slide prevent unification, ie.

slide Go to individual slides view. View text for this slide. Go to overview.

What is left for unification are characters representing the same thing but exhibiting no visual differences, or relatively minor differences such as different sequence for writing strokes, differences in stroke overshoot and protrusion, differences in contact and bend of strokes, differences in accent and termination of strokes, etc.

slide Go to individual slides view. View text for this slide. Go to overview.

Encoding methods

Although the terms character set and character encoding are often treated as the same thing, we will use them to mean separate things in this tutorial.

A character set or repertoire comprises the set of atomic text elements you will use for a particular purpose – be it the repertoire of characters required to support Western European languages in computers, or the repertoire of characters a Chinese child will learn at school this year (nothing to do with computers).

The character encoding reflects the way these abstract characters are mapped to numbers for manipulation in a computer.

In a standard such as ISO 8859 encodings tend to use a single byte for a given character and the encoding is straightforwardly related to the position of the characters in the set.

The above distinction becomes helpful when discussing Unicode because the set of characters (ie. the character set) defined by the Unicode Standard can be encoded in a number of different ways. The type of encoding doesn’t change the number or nature of the characters in the Unicode set, just the way they are mapped into numbers for manipulation by the computer (see the next slide).

Similarly, on the Web, the document character set of an XML or HTML document is always Unicode. A particular XML or HTML document, however, can be encoded using any encoding, even encodings that don’t cover the full Unicode range such as ISO 8859-1 (Latin1). However, because the document character set is Unicode, even if a Web page uses Latin1 as its encoding, it can use special constructs called numeric character references (eg. ሴ) to include any Unicode character outside that encoding.

Character encodings are the things that have names in the IANA registry.

slide Go to individual slides view. View text for this slide. Go to overview.

The Unicode Standard assigns a unique scalar number to every character in its character set. The resulting numbered set is referred to as a coded character set. Units of a coded character set are known as code points.

In Unicode there are a number of ways of encoding the same character. These include UTF-8, UTF-16, and UTF-32.

UTF-8 uses 1 byte to represent characters in the old ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.

UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.

UTF-32 uses 4 bytes everywhere. In the chart on the slide, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.

This explanation glosses over some of the detailed nomenclature related to encoding. More detail can be found in Unicode Technical Report #17.

slide Go to individual slides view. View text for this slide. Go to overview.

Respecting character boundaries

You can see from the previous slide that nearly all Unicode characters are encoded using multiple bytes. In an encoding such as UTF-8 the number of bytes actually used depends on the character in question. This means that care has to be taken to recognize and respect the integrity of the character boundaries. Applications cannot simply handle a fixed number of bytes when performing editing operations such as inserting, deleting, wrapping, cursor positioning, etc. Collation for searching and sorting, pointing into strings, and all other operations similarly need to work out where the boundaries of the characters lie in order to successfully process the text. An ASCII character such as 'a' will be represented by a single byte.

slide Go to individual slides view. View text for this slide. Go to overview.

Hebrew characters are represented in UTF-8 using two bytes.

slide Go to individual slides view. View text for this slide. Go to overview.

Ideographic characters in the Basic Multilingual Plane are three bytes in UTF-8. The moral of this story is, don’t use bytes, use characters.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide illustrates how things go wrong with old technology that is not multi-byte aware. In this case the author attempted to delete a Chinese character, and the application translated that to mean delete a single byte. This causes a misalignment of all the following bytes, and produces garbage.

slide Go to individual slides view. View text for this slide. Go to overview.

Inputting ideographic characters

Getting to the right character quickly

We have noted that East Asian character sets number their characters in the thousands. So how do you, quickly, find the one character you want while typing?

In the past people have tried using extremely large keyboards, or forcing people to remember the code point numbers for the character. Not surprisingly these approaches were not very popular.

The answer is to use an IME (Input Method Editor). An IME (also called a front-end processor) is software that uses a number of strategies to help you search for the character you want.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide summarizes the typical steps when typing in Japanese using a standard IME for Windows.

The user types Japanese in romaji transcription using a QWERTY keyboard. As they type the transcription is automatically converted to hiragana or katakana. Ranges of characters are accepted by a key press as they go along. To convert a range of characters to kanji, the user presses a key such as the space bar. Typically the IME will automatically insert into the text the kanji that were last selected for the transcription that has been input. If this is not the desired kanji sequence, the user presses the key again and a selection list pops up, usually ordered in terms of frequency of selection. The user picks the kanji characters required, and confirms their choice, then moves on.

Note that there are only a few alternatives for the sequence かいぎ. If the user had looked up かい and ぎ separately they would have been faced each time with a large number of choices. The provision of a dictionary as part of the IME for lookup of longer phrases is one way of speeding up the process of text entry for the user.

Ordering by frequency and memory of the last conversion are additional methods of assisting the user to find the right character more quickly.

slide Go to individual slides view. View text for this slide. Go to overview.

Chinese input methods

Whereas the Japanese romaji input method predominates for Japanese, there are a number of different approaches available for Chinese.

Pinyin was introduced with Simplified Chinese, and is typically used in the same geographical areas, ie. Mainland China and Singapore.

It is essentially equivalent to the romaji input method. The numbers you see in the example above indicate tones. This dramatically reduces the ambiguity of the sounds in Chinese.

One of the problems of pinyin is that the transcription is based on the Mandarin or Putonghua dialect of spoken Chinese. So to use this method you need to be able to speak that dialect.

slide Go to individual slides view. View text for this slide. Go to overview.

A more common input method in Taiwan uses an alphabet called zhuyin or bopomofo. This alphabet is only used for phonetic transcription of Chinese. Essentially it is the same idea as pinyin, but with different letters. The tones in this case are indicated by spacing accent marks (shown only in the top line on the slide) which in Unicode are unified with accents used in European languages. An appropriate font will however display these as tone marks rather than accents in this context.

slide Go to individual slides view. View text for this slide. Go to overview.

A very different approach allows the user to create the desired character on the basis of its visual appearance rather than the underlying phonics.

Changjie input uses just such an approach. The keyboard provides access to primitive ideographic components which, when combined in the right sequence lead to the desired ideograph.

An advantage of an approach such as changjie is that you don’t have to speak Mandarin. A drawback is the additional training required.

Note that pen-based input is another useful approach. (In fact, this is particularly helpful for people who do not speak Chinese or Japanese. Once you master a few simple rules about stroke order and direction, you can use something like Microsoft’s IME Pad to draw and select characters without any knowledge of components or pronunciation.)

slide Go to individual slides view. View text for this slide. Go to overview.

The examples on this slide show the keystrokes required to enter the text used in the previous slides containing pinyin and bopomofo examples.

slide Go to individual slides view. View text for this slide. Go to overview.

Alternative representations of characters

In some cases you may come across an ideograph that your font or your character set doesn’t support. Unicode provides a way of saying, “I can’t represent it, but it looks like this character.”

The approach requires you to add character U+303E immediately followed by a similar looking character. This at least gives the reader a chance to guess at the character that is missing.

slide Go to individual slides view. View text for this slide. Go to overview.

Another way of addressing the same problem is to use the ideographic description characters introduced in Unicode 3.0.

This approach allows you to draw a picture showing what are the various components of the character you can’t represent, and where they appear. The lower line on the slide shows how you would describe the large character near the top. Note that this is interpreted recursively.

Note also, that this should not be treated as in any way equivalent to an existing ideograph when collating strings.

slide Go to individual slides view. View text for this slide. Go to overview.

Summary

slide Go to individual slides view. View text for this slide. Go to overview.

Complex script rendering

Definitions

Before getting into this section it is important to draw attention to the difference between characters and glyphs.

A character is a semantic unit representing an indivisible unit of text in memory.

A glyph is the visual representation of a character or sequence of characters.

The example on the slide shows two glyphs for the single character a, and two glyphs for a single character Han character. This distinction will become very important in this section. For more information about the distinction between characters and glyphs, see Unicode Technical Report #17.

A font, by the way, is a collection of glyphs.

slide Go to individual slides view. View text for this slide. Go to overview.

Combining characters

Arabic & Hebrew short vowels

Arabic and Hebrew scripts usually do not represent short vowel sounds. The languages are so heavily pattern based that readers can adequately guess at the pronunciation of the words.

In circumstances where ambiguity appears, such as the name of the German town Mainz in the example on the slide, short vowels are represented as diacritics attached to the base consonants.

slide Go to individual slides view. View text for this slide. Go to overview.

Here, for example, the slide shows the Arabic word for engineer, pronounced ‘muhandis’.

It is actually written, ‘mhnds’.

slide Go to individual slides view. View text for this slide. Go to overview.

If needed, the short vowels (there are only 3 in Arabic) are represented as shown on the lower line of the slide. Note that the small circle diacritic indicates NO intervening vowel. (Sequences of code points in Arabic and Hebrew on this and following slides will be shown in left to right order, to emphasise that the underlying order is logical.)

These short vowels are separate combining characters in the text stream that are displayed in the same two-dimensional visual space with the base character. Combining characters do not generally appear without a base character.

slide Go to individual slides view. View text for this slide. Go to overview.

Context-sensitive placement of diacritics

When displaying combining characters, care has to be given to appropriate positioning. In the Thai example on the slide, the same character code is used to represent both the tone mark glyphs that are circled. There are not two different characters based on the desired visual position. The font has to work out the best position for the glyph according to the run-time visual context.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide provides another example of context-sensitive positioning of combining characters.

The short vowel ‘i’ in Arabic is usually drawn below the base character. This is normally the only way of distinguishing it from the short vowel ‘a’, which is displayed above the base character.

In this example, however, an additional shadda diacritic is introduced. The shadda is used to lengthen the consonant it is attached to. In that context it is common (though not mandatory) for the ‘i’ vowel diacritic to appear above the base character, but below the shadda so you can still tell it apart from ‘a’.

Note also that this example introduces the idea that you can have more than one combining character associated with a base character.

slide Go to individual slides view. View text for this slide. Go to overview.

Vowel signs

In Indic scripts and scripts derived from them a consonant character carries with it an inherent vowel. The character on the top line on the slide is transcribed ‘ka’, not just ‘k’.

If you want to follow the ‘k’ sound with a different vowel, you append a vowel sign to the consonant character. This vowel sign overrides the inherent vowel with a different sound.

In Indic scripts vowel signs are all combining characters. Unlike the Arabic and Hebrew short vowels, however, some of these combining characters may also take up additional space on a line (see the example ‘kii’ on the slide). They are referred to as spacing combining characters.

slide Go to individual slides view. View text for this slide. Go to overview.

Thai, being derived from Indic scripts, also has vowel signs, although they are used in a slightly more complex way.

In the example on this slide, three vowel signs surround the consonant to produce the desired effect.

Whereas in the Indic scripts all vowel signs are combining characters, only one of the vowel signs in this example is combining. The other two (indicated by arrows) are normal spacing characters. This is a distinction introduced to Unicode at the request of the Thai national standards body.

slide Go to individual slides view. View text for this slide. Go to overview.

Coding combining characters

When it comes to implementing combining characters, an important question to ask is what order should be applied to them and the base character. Unless you have agreement on this, you can have serious problems when passing data between systems.

The Unicode Standard requires that all combining characters follow the base consonant in a Unicode string. (So the example to the left on the slide is correct.)

slide Go to individual slides view. View text for this slide. Go to overview.

Each combining character has a combining class property expressed as a numeric value. Combining characters that appear in the same location relative to the base character when displayed will typically share the same combining class. For example acute, grave and circumflex accents all appear above the base character and all share the same combining class.

Multiple combining characters do not have to be in any particular order unless they are in one of the Unicode normalisation forms. The standard requires that sequences of combining characters should be treated as equivalent if they all have different combining classes.

Unicode normalisation, however, applies a canonical ordering to multiple combining characters.

If characters have the same combining class they are likely to interact typographically to produce different possible results, as in the case above. In this case the ‘inside-out’ rule is applied. This rule states that the proximity of the combining character in the text stream must match the visual proximity.

slide Go to individual slides view. View text for this slide. Go to overview.

Precomposed vs. decomposed

There are many precomposed characters in Unicode that have an accent or diacritic already combined with a base character (such as a-acute above). It is however also possible to represent this character using a simple ‘a’ followed by a combining acute accent. This is referred to as a decomposed character sequence.

The Unicode Standard states that both of these approaches must be considered canonically equivalent.

slide Go to individual slides view. View text for this slide. Go to overview.

Normalization

To facilitate the process of string comparison for operations such as searching, sorting and comparison it is helpful to adopt a standard policy with regard to precomposed versus decomposed variants of a character sequence, and the order in which multiple combining characters appear. This can be achieved by applying an appropriate normalization form. The Unicode Standard provides a normalization form called NFD that represents all character sequences in maximally decomposed form. In addition to decomposition, NFD applies a standard order to multiple composing characters attached to a base character. As an alternative, the Unicode Standard offers NFC. NFC is achieved by applying NFD to the text, then re-composing characters for which precomposed forms exist in version 3.0 of the standard.

Note that there are actually some precomposed forms in the Unicode character set that are not generated by NFC, for reasons we will not go into here. In addition, where there is no precomposed form, a character sequence is left decomposed, but canonical ordering is still applied to all combining characters.

The Unicode Standard also offers two more normalization forms, NFKD and NFKC, where K stands for ‘kompatibility’. These forms are provided because the Unicode character set includes many characters merely to provide round-trip compatibility with other character sets. Such characters represent such things as glyph variants, shaped forms, alternative compositions, and so on, but can be represented by other ‘canonical’ versions of the character or characters already in Unicode. Ideally, such compatibility variants should not be used. The NFKD and NFKC normalization forms replace them with more appropriate characters or character sequences. (This, of course, can cause a problem if you intend to convert data back into its original encoding, because you lose the original information.)

slide Go to individual slides view. View text for this slide. Go to overview.

Context-sensitive glyph shaping

Word final glyph variants

In Hebrew and Greek there are certain characters (only a small number) that look different in the middle of a word and at the end. Two examples are shown on the slide. In each example, the same consonant appears in the middle of a word and at the end of a word in the sample text, and has a different appearance.

Due to traditional approaches, these shapes are encoded separately and are typed in using distinct keys on the keyboard. This is manageable because there are so few such characters.

In other scripts a very different approach has to be taken.

slide Go to individual slides view. View text for this slide. Go to overview.

Cursive script

Arabic is often referred to as a cursive script with the meaning that letters in a word are usually joined to each other – whether handwritten or printed.

The slide shows the unjoined form of the letter AIN at the top right, and, at the bottom, three joined examples of of the same letter. As you can see, the shape changes quite dramatically.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows some more examples of un-joined Arabic letters (right column) and their various joining forms (to the left).

It is important to understand that there is only ONE code point here for each letter. The various different visual forms are only font-based glyphs chosen to suit the run-time visual context.

(There are compatibility characters encoded in Unicode for specific joining forms, but these should not be used for storing Arabic text edited in Unicode. They are only provided to allow round-trip conversions between Unicode and legacy character encodings. In Unicode normalized text these are all mapped to the main Unicode Arabic block.)

The shapes on the slide can be referred to (from right to left) as independent, initial, medial and final.

slide Go to individual slides view. View text for this slide. Go to overview.

Inputting cursive glyphs

On previous slides I mentioned the ‘run-time’ context. This is quite important. If I type in the Arabic letter HEH shown at the top of the slide it will initially be in an independent glyph form. If I press exactly the same key on the keyboard and insert exactly the same character alongside it in memory, however, the original letter HEH will be expected to join with the second HEH. The shape of the first HEH will therefore change to ‘initial’, and the second HEH will be in ‘final’ shape. Type another HEH and the second will become ‘medial’, and so on.

In this way Arabic text is constantly changing as you type. The editing application also has to adapt these glyphs as you do things such as backspace, insert or delete text.

slide Go to individual slides view. View text for this slide. Go to overview.

Conjunct consonants

When two Indic consonants appear together without any intervening vowel sound they may form a conjunct, ie. the consonant cluster is rendered as a composite shape. This composite shape may show a vertical or horizontal mixture of the base shapes. In some cases the original constituents of a conjunct may not be recognizable.

One approach that is very common is the use of a half-form to represent the initial consonant in the cluster. An example of this is shown on the bottom line of the slide.

It is important to bear in mind, once again, that this is all glyph magic. The individual consonants are all still represented using the regular code points in memory, it is only the visual appearance that changes. There are no special code points for half-form glyphs. The appropriate glyph is simply applied at display time according to the rendering rules of the script.

slide Go to individual slides view. View text for this slide. Go to overview.

In actual fact, there is a vital ingredient to a conjunct form that we have not yet discussed. It is called a virama. The virama is often called ‘vowel killer’.

If you simply put two consonants side by side in Unicode, as in the top line on the slide, you will get two separate consonants displayed (with the assumption on the part of the reader that there is an inherent vowel between them).

It is only when you put a virama character between them that they combine to form a conjunct. So the conjunct glyph shown middle right actually represents three underlying characters.

The number of conjunct forms can vary from font to font. Some fonts will be capable of rendering more than others. So what happens if the font you are using doesn’t have a conjunct glyph for the combination you want to create?

In this case the virama is shown visually as a combining mark – see the last line on the slide. (In fact, in modern Tamil this is the default approach.)

slide Go to individual slides view. View text for this slide. Go to overview.

More character to glyph rendering

Special joining forms

The concepts we have discussed so far in this section on combining characters and glyph shaping have shown that there is no one-to-one correspondence, as there usually is in English, between the characters in memory and the glyphs displayed on screen. Indeed, sometimes complex rules are needed to determine the displayed result.

We have seen some of the more basic transformation cases, but over the next few slides we will take a quick look at some additional possibilities. This is by no means intended to give you all the information you need to implement these scripts – merely expose you to some slightly more advanced behavior.

First out we look at some font-dependent alternatives for joining Arabic glyphs. Arabic glyphs typically join along the baseline, but in some (typically more classical) fonts, specific pairings join above the baseline as shown in the top left example on the slide.

slide Go to individual slides view. View text for this slide. Go to overview.

The use of half-forms in Indic scripts could also be seen as a kind of special joining form.

slide Go to individual slides view. View text for this slide. Go to overview.

Positioning variation

Spacing combining characters to the left of the base consonant are common in Indic scripts. Here what is important to bear in mind is that the Unicode rule about combining characters following the base character still applies. It is only as part of the rendering process that the glyph for the combining character is made to appear to the left.

The example on this slide shows how the Hindi word for ‘Hindi’ would normally be displayed, but on the second line shows the order of the characters in memory.

slide Go to individual slides view. View text for this slide. Go to overview.

The example text from the Thai sample shown on this slide illustrates the same effect in Thai. This word is pronounced very much like ‘program’, and the vowel sign at the far left is actually pronounced after the third character (ie. it is the ‘o’ sound after ‘pr’).

We have already seen, however, that vowel signs are not necessarily combining in Thai, so no reordering is actually needed in this case. The characters displayed are actually stored in the same order in memory.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows some additional examples of reordering during display.

The top example shows a Tamil combining character that appears on both sides of the base consonant when displayed.

The bottom example shows the Devanagari repha in a consonant cluster. The RA code that appears at the beginning of the cluster in memory is rendered as a diacritic above the vowel sign that completes the syllabic cluster.

slide Go to individual slides view. View text for this slide. Go to overview.

Ligatures

Ligatures are very common. Essentially a ligature is a single glyph that represents more than one underlying character.

The example shown here is of a mandatory ligature in Arabic. An ALEF character followed by a LAM character must always be displayed as a single lam-alef glyph.

slide Go to individual slides view. View text for this slide. Go to overview.

The top line on this next slide shows another Arabic ligature. This ligature is optional and will only be displayed if the font developers included it. In other words, the number of ligatures available will generally vary with the font being used.

The second line shows that ligatures in Arabic also have joining forms when they occur alongside other characters.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows some ligatures used to render Indic consonant clusters.

slide Go to individual slides view. View text for this slide. Go to overview.

Again, the number of ligatures available in a font varies. In some fonts the lower example may simply be rendered using a visible virama.

slide Go to individual slides view. View text for this slide. Go to overview.

Ligatures are not only used for combining consonants. This slide shows the effect of combining a single vowel sign with various consonants in Tamil. As you can see, the combinations produced some complex and vary varied results.

slide Go to individual slides view. View text for this slide. Go to overview.

Joiner & non-joiner control characters

We have seen how Arabic glyphs join up with each other when juxtaposed. Unicode provides some special characters, invisible to the naked eye and to processing algorithms, to help control joining behaviour manually.

The zero-width non-joiner character (U+200C) can be inserted between the three characters LHM to create the effect on the second line. Here the three characters are not separated by spaces, but the glyphs no longer join.

The zero-width joiner character (U+200D), on the other hand, has the opposite effect. The three characters on the third line have spaces between them, but the joiner character is used to produce the joining forms of the glyphs. This behaviour is occasionally needed for correctly rendering Arabic text.

slide Go to individual slides view. View text for this slide. Go to overview.

Unicode allows you to force a consonant + virama sequence to display the virama where the font would otherwise have used a half-form – add a zero width non-joiner immediately after the virama of the dead consonant.

Unicode allows you to force a dead consonant to assume a half-form rather than combine as part of a ligature – place a zero width joiner immediately after the virama.

slide Go to individual slides view. View text for this slide. Go to overview.

Summary

slide Go to individual slides view. View text for this slide. Go to overview.

Text direction

Vertical text

Text flow

Vertical Chinese, Japanese and Korean flows from the right to the left of the page. (All of these scripts can also be read horizontally.)

Vertically oriented text is still very common in printed matter such as books, magazines and newspapers.

slide Go to individual slides view. View text for this slide. Go to overview.

Rotations & shifts

This and the two following slides show differences between the same text when set horizontally and vertically.

On this slide we see how parentheses and vowel lengthening marks are rotated. Bear in mind that this does not reflect any change in the underlying characters – the change is purely in the choice of glyph.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows how punctuation and small kana characters move from one corner of the cell square to the other. This is not a question of rotation.

This is not always an issue. In Chinese a period or a comma is typically centered in the character cell.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows the treatment of embedded Latin text. Text typically flows down the line at a 90 degree rotation from the East Asian characters, however acronyms and initials are commonly not rotated.

slide Go to individual slides view. View text for this slide. Go to overview.

Tate chu yoko

Vertical text often includes short runs of horizontal numbers or Latin text, called tate chu yoko. The example on this slide shows the heading of a newspaper article in Korean.

(Note also the use of a hanja character meaning ‘hundred’ between the 3 and the 76.)

slide Go to individual slides view. View text for this slide. Go to overview.

Vertical columns

This slide illustrates the path of the eye in two-column vertical text. The columns, of course, run horizontally. If you are implementing an OCR application, this is an important thing to get right.

slide Go to individual slides view. View text for this slide. Go to overview.

Bidirectional text

Right alignment

Arabic and Hebrew scripts run predominantly from right to left and are right justified.

slide Go to individual slides view. View text for this slide. Go to overview.

Bidirectional ordering

Text in right-to-left scripts does not always flow right to left. Embedded Latin text and all numbers are read left to right. It is for this reason that these scripts are referred to as bidirectional.

The slide illustrates the direction of the eye while reading the line containing a date.

Note that there may be slight differences between the Arabic and Hebrew approaches here. The Arabic for the dates ’10-12’ reads ’12-10’ – only the numbers flow left to right. In Hebrew it is likely that you would see ’10-12’ – the whole expression now runs left to right.

slide Go to individual slides view. View text for this slide. Go to overview.

It is important to understand that the order of characters in memory is unidirectional, following what is called the logical order. The reordering you see on screen or paper is magic worked by the text rendering algorithms of the software.

The order in memory essentially follows the order in which the text is typed.

slide Go to individual slides view. View text for this slide. Go to overview.

East Asian scripts often used to also be read from right to left, though this is much less common nowadays. The example shown on the slide is a Traditional Chinese newspaper where the text of the articles is predominantly vertical. The headings of the articles and the captions of the pictures run right to left. In this way the reading direction of the horizontal text is consistent with the flow of text in the surrounding articles.

Japanese text never does this any more. Even in newspapers containing vertically set text, the titles and captions always run left to right these days.

slide Go to individual slides view. View text for this slide. Go to overview.

Unicode bidirectional algorithm

The Unicode Standard has a Bidirectional Algorithm that should be used to support the display of bidi text. Unlike many character sets, Unicode provides several properties of semantic information for every character. One of these properties indicates the behavior of the character with regard to inline directionality.

All Arabic and Hebrew letters have a directional type of right-to-left. Most other letters have a left-to-right type, including all numbers. Punctuation is typically directionally neutral, since its location depends on the context.

The next slide illustrates the use of this typing.

slide Go to individual slides view. View text for this slide. Go to overview.

As each of the right-to-left typed characters are typed on the first and second lines the bidirectional algorithm will place the next character to the left of the previous one.

Because the numbers have a type of left-to-right, the 0 is automatically added to the right of the 1.

The punctuation relies on context to determine its position, so when it is initially entered the rendering algorithm assumes that it is a sentence-final period and part of the overall right-to-left flow. If a space and some more Hebrew characters was then input the period would remain there.

If, however, the next input character is a number, the rendering algorithm views the period as part of the left-to-right flow of the number, and moves it automatically to the right of the 10.

The bidirectional algorithm is somewhat more complicated than this in reality, but this helps you understand the basic idea.

slide Go to individual slides view. View text for this slide. Go to overview.

Mirrored characters

The treatment of paired characters such as parentheses deserves a brief mention here. The text on the slide flows consistently from right to left. The point of interest is the shape of the parentheses.

The first parenthesis encountered while reading – looks like ')' – is actually a LEFT PARENTHESIS, ie. in left-to-right text it looks like '('. The bidirectional algorithm expects the visual shape of such mirrored characters to be swapped when used in a right-to-left context. (In Unicode 1.0 the LEFT PARENTHESIS was actually referred to as OPENING PARENTHESIS.)

This approach facilitates the matching of character sequences across scripts.

slide Go to individual slides view. View text for this slide. Go to overview.

Bidi formatting control characters

There are occasionally situations where the bidirectional algorithm needs a little help to determine the directional context.

Unicode provides special, invisible control codes to help clarify such ambiguities or intentional deviations from the rules of the bidirectional algorithm.

The sample sentence preceded by the asterisk on the slide shows what you get if you rely solely on the bidirectional algorithm, due to the inherent ambiguity of the phrase.

By applying an embedding control character as shown in the view of the logical order of characters at the bottom of the slide, the correct result can be obtained (the middle line).

NOTE: Although these codes should do the job, in marked up text such as HTML and XML you should not use these control codes but use available markup instead (eg. the dir attribute in HTML).

slide Go to individual slides view. View text for this slide. Go to overview.

Visual selection

The difference between the logical, underlying order of bidirectional text and the displayed order of characters also has an impact on highlighting.

The next slide illustrates what may happen if you place your cursor at the point labeled “Start here” and extend your selection to the point marked “End here”.

slide Go to individual slides view. View text for this slide. Go to overview.

As you can see here, two separate ranges of text have been highlighted, one of which falls outside the two points we mentioned on the previous slide. A look at the underlying codes (see the bottom of the slide) immediately reveals the inherent logic in this approach. Although this operation produced two visual selections, it produced only a single logical selection in memory.

It is also possible to find applications that would have produced a single visual highlight in this case. It is important to note, however, that this would represent two independent logical selections in memory.

slide Go to individual slides view. View text for this slide. Go to overview.

Directional bias in layout & graphics

Screen layout

A predominant reading direction of right-to-left can have an impact on more than just the text. If you look at the Arabic and Hebrew sample pages in Internet Explorer you will see that the scroll bar appears on the left.

In Arabic and Hebrew environments the layout of screen information is typically mirrored to reflect the scanning direction of the text.

The screen shot of an editor on the slide shows the following differences from the English version:

slide Go to individual slides view. View text for this slide. Go to overview.

In addition, the text on pull down menus is on the right and accelerator keys are listed to the left.

If there were submenus they would cascade to the left, not the right as in an English user interface.

slide Go to individual slides view. View text for this slide. Go to overview.

All the items on this dialogue box are mirrored by comparison to the English version.

slide Go to individual slides view. View text for this slide. Go to overview.

Graphics, icons and charts

In addition to layout of user interfaces, directionality may affect the layout of charts, tables, spreadsheets, collated pictures, and the like. (This appears to more consistently the case for Arabic documents than for Hebrew.)

(It is worth visiting the Arabic or Hebrew sample page on the Web to see the effect of changing the directionality of the page on the table cells shown at the top of this slide. Just click on the button provided.)

Also, any graphics showing directional bias will need to be mirrored.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows a small selection of icons that exhibit directional bias and will probably need to be replaced with mirrored versions in an Arabic or Hebrew context.

Because Arabic and Hebrew documents run right to left, the turnover should appear to the left. The icons in the middle include directionality that is based on the assumed direction of text flow. The top right icon shows cascading windows, but on many Arabic or Hebrew platforms the windows cascade to the left. And the bottom right icon portrays a table, which as we saw on the previous slide would most likely be mirrored.

Note that the process of producing mirrored versions of these icons is fairly straightforward – just flip the graphic. This becomes more difficult if non-symmetrical letters have been used. (Although of course on another level, one could question the appropriateness of a Latin letter on an Arabic or Hebrew user interface.) In addition, the functionality associated with the undo and redo icons may require a relocation of icons, rather than a simple mirroring of the graphics.

slide Go to individual slides view. View text for this slide. Go to overview.

Summary

slide Go to individual slides view. View text for this slide. Go to overview.

Linguistic boundaries, line breaking & justification

Word boundaries

Western

It is not easy to determine what is meant by ‘word’. Typically people initially think of items in a sentence separated by spaces or certain types of punctuation. In languages such as German and Turkish, however, such runs of text can include a number of concepts run together.

This and the next few slides will consider in a very basic way the relevance of ‘words’ to some of the scripts in discussion here. For want of better terminology, we will use the term word in a general sense to mean a unit of meaning smaller than a phrase or sentence. We will also consider the highlighting behavior of Windows when you double-click in the middle of some text.

The example on this slide is Greek. Greek words are delimited by spaces. Typically double-clicking in Windows will highlight the text between spaces (and, depending on your settings, some space too).

slide Go to individual slides view. View text for this slide. Go to overview.

Chinese

Chinese does not use spaces for word separation. Most ideographs have word-like meanings, although it is common for a sequence of characters to have a composite meaning derived from the individual parts.

Windows uses a dictionary lookup approach for double-click selection. The example on this slide was produced by double-clicking one of the two characters highlighted.

slide Go to individual slides view. View text for this slide. Go to overview.

Japanese

Japanese also makes no use of spaces for word separation. The apparent spacing in the example above is simply the lack of ink in the mono-spaced character cells.

The examples on this slide shows the effect of double-clicking in Windows in a number of different contexts. The first two show how Windows uses a dictionary-based approach to locate word boundaries within a run of kanji and hiragana text respectively. The third example is katakana text. The fourth example (at the bottom) highlights both the kanji and the hiragana that constitute an inflected word.

slide Go to individual slides view. View text for this slide. Go to overview.

Korean

Korean does separate words with spaces.

Double-clicking works in the same way as the Greek example.

slide Go to individual slides view. View text for this slide. Go to overview.

Thai

Thai uses spaces, but to separate phrases or sentences, not words. At the same time there is a fairly clear notion of where word boundaries fall.

Double-clicking on the text highlights one word at a time. Windows uses a dictionary-based approach to achieve this. Other applications may require the user to type in zero-width spaces after every word to make word detection and line breaking work.

slide Go to individual slides view. View text for this slide. Go to overview.

Line breaking

Basic alternatives

In this section we will look at line breaking. Justification often occurs at the same time, but we will examine it separately to keep the explanations simple.

Line breaking is typically word-based or character-based. Character-based line breaking usually involves the application of special character-specific rules.

slide Go to individual slides view. View text for this slide. Go to overview.

If you have Internet Explorer version 5 or above you can see how each script wraps by going to the the word wrap tester page and changing the width of the browser window. It is impressive to see how, if all scripts are displayed together, each line wraps according to its own rules.

English, Greek, Hindi, and Russian text wraps whole words onto the next line.

Arabic and Hebrew do the same, but the text wraps to the right. Wrapping of embedded Latin text produces a special effect that will be described later.

Chinese, Japanese and Korean all wrap on a character by character basis, subject to the rules that will be described later. Korean is sometimes wrapped on a word basis, but it is more common these days to wrap on a character basis, despite the fact that Korean words are separated by spaces.

Thai is wrapped on a word basis, but a dictionary or other mechanism is needed to detect word boundaries, since they are not separated by spaces.

slide Go to individual slides view. View text for this slide. Go to overview.

CJH line breaking rules

This slide shows the rules for character-based line breaking that apply by default for Japanese in Office XP, minus the full vs. half width duplicates.

Similar rules apply to Chinese and Korean line breaking.

slide Go to individual slides view. View text for this slide. Go to overview.

The question arises, if Japanese and Chinese are typically grid-like in layout, what happens when a character such as a comma would by default appear at the beginning of a line as in the first example above.

Typically there are two possible approaches.

  1. the preceding character is pulled down to the next line

  2. the comma is left protruding into the margin.

These alternatives are illustrated in the lower level panels on the slide.

In fact there is another alternative if justification is available, but we will leave that for the next section.

slide Go to individual slides view. View text for this slide. Go to overview.

Wrapping Latin text in Arabic & Hebrew

This slide shows the result of breaking a line in the middle of some Latin text in Arabic and Hebrew. The result is not immediately obvious for people unaccustomed to these scripts, as the order of words appears to be swapped.

This is because, although you can read in either direction horizontally, you are only expected to read down from one line to the next.

It is important to note that the order of characters in memory has NOT changed. This is purely rendering magic.

slide Go to individual slides view. View text for this slide. Go to overview.

Hyphenation

Latin and Cyrillic scripts allow hyphenation of words at the end of a line in order to achieve a better fit.

It is important to note that hyphenation rules differ from language to language within the same script. The slide shows hyphenation that is not permitted according to German orthographic rules.

slide Go to individual slides view. View text for this slide. Go to overview.

Justification

Basic alternatives

This slide lists possible approaches to justification. These include:

In practice, justification will commonly involve adjustment of both word and glyph spacing at the same time.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows an unjustified text.

slide Go to individual slides view. View text for this slide. Go to overview.

On this slide, justification has used inter-word spacing only. Note how the result is less than perfect, with large inter-word spaces on the second line, and no justification to the single word on the third line.

slide Go to individual slides view. View text for this slide. Go to overview.

In this third slide, both inter-word and inter-character spacing have been applied to the same text, and produce a much better result.

Note that justification does not only involve expansion. In fact it is common for a justification algorithm to attempt to reduce inter-word or inter-character spacing first, up to a certain limit, before expanding them.

Note also that expanding inter-character spaces in German will indicate to a German reader that the words are emphasized, not justified. So stretching inter-character spaces is uncommon in German text.

slide Go to individual slides view. View text for this slide. Go to overview.

Justification in Chinese & Japanese

This slide illustrates how justification can be used to remove the blank space at the end of the first line of text that we saw in the section about line breaking. The justification involves equally expanding the space between all characters on the first line.

Typically in character-based justification, rules are applied to different types of character in successive waves. For example, the algorithm may attempt to reduce the spacing around punctuation first, and only when more adjustment is needed turn to adjusting the spacing between ideographs.

In the section on line breaking we saw how punctuation can be left protruding into the right margin. Justification can also be used to draw this punctuation into the main body of text by reducing the inter-character spacing across the line.

slide Go to individual slides view. View text for this slide. Go to overview.

Justification in Arabic

This slide illustrates justification in Arabic based on extension of the baseline.

More sophisticated rendering algorithms produce this effect without adding additional characters to memory. A less sophisticated approach may involve adding baseline extension characters called tatweel or kashida (U+0640) to the text.

Note too that this kind of baseline extension is also used for emphasising text in Arabic, for example in headings.

slide Go to individual slides view. View text for this slide. Go to overview.

Summary

slide Go to individual slides view. View text for this slide. Go to overview.

Other typographic & implementation issues

Character size & line height

Glyph complexity

In this and following slides I look at the minimum number of pixels required on something like an LCD panel to achieve a quality rendering of characters in a number of scripts. This has implication for line height and pixel resolution on screen. It also tends to impact the use of bolding and italicization, since they require additional pixels for rendering.

English generally fits adequately in a 6x8 pixel block.

Unfortunately this is not true for other languages based on the Latin script. Accents over or under upper case letters in particular tend to demand additional pixels.

Japanese typically uses a 16x16 pixel square that include a gutter of 1 pixel horizontally and vertically. I have seen 14x14 pixel implementations, however, that are deemed acceptable. Such arrangements require the omission of some strokes for the more complicated characters, but Japanese people are still able to understand what character is intended.

slide Go to individual slides view. View text for this slide. Go to overview.

Most Thai characters require only around 7 pixels in width, although there are a small number that may require around twice that.

In height, however, Thai demands a minimum of 22 pixels (plus more inter-line spacing than is usual for Latin text).

slide Go to individual slides view. View text for this slide. Go to overview.

In Chinese (especially Traditional) there are many hundreds of characters that cannot be rendered in a16x16 pixel grid. An adequate size is likely to involve around 24x24 pixels. (Count the lines and spaces on the example on this slide and you will see that a minimal representation of this character requires more than that.)

slide Go to individual slides view. View text for this slide. Go to overview.

Line height & inter-line spacing

Even after supplying sufficient numbers of pixels to accommodate the complex shapes we have seen, many of these scripts demand additional inter-line spacing.

The example on the left of this slide shows what 16x16 pixel characters would look like without additional inter-line spacing. There are two issues here:

  1. the characters appear to run into each other and are difficult to read (especially if underlining is applied)

  2. it is not immediately apparent whether the text should be read vertically or horizontally.

Additional spacing as shown on the right alleviates both of these problems.

slide Go to individual slides view. View text for this slide. Go to overview.

Baseline alignment

Another issue to be borne in mind concerns baseline alignment. The slide shows a number of possible types of baseline.

If using a font that includes more than one script this is not usually an issue since the font designer will normally ensure an appropriate match between baselines and character sizes for glyphs from different scripts. If, however, you are mixing scripts using different fonts, it is important to ensure that alignment is appropriate.

slide Go to individual slides view. View text for this slide. Go to overview.

Proportional spacing

Whereas East Asian characters tend to use mono-spaced glyphs as the default, a script such as Arabic is extremely difficult to fit into a mono-spaced font. Arabic really demands proportionally-spaced glyphs.

In addition, scripts that use combining characters require the ability to overlap characters. This may cause significant problems for LCD panels.

slide Go to individual slides view. View text for this slide. Go to overview.

Ruby

Furigana

The term ruby is used to refer to annotations typically occurring in East Asian scripts. In Japanese this is called furigana.

Furigana is typically used to provide phonetic transcriptions (in hiragana) of obscure characters, or characters that the reader is not expected to be familiar with. For example it is widely used in education materials and children’s texts.

Phonetic transcription normally appears above horizontal text. Sometimes semantic information is provided below the horizontal text.

slide Go to individual slides view. View text for this slide. Go to overview.

In vertical text, above equates to right, and below equates to left.

slide Go to individual slides view. View text for this slide. Go to overview.

Bopomofo

Such annotation in Traditional Chinese uses bopomofo to indicate the pronunciation, and rather than appearing above the main text, the annotation is included vertically to the right of each character, whether the main text is vertical or horizontal.

slide Go to individual slides view. View text for this slide. Go to overview.

Interlinear annotation characters

Unicode provides special control characters that can be used to indicate what is ruby in plain text, as shown on the slide.

NOTE: these characters should not be used in a markup language such as HTML if a markup-based alternative is provided.

slide Go to individual slides view. View text for this slide. Go to overview.

Miscellaneous

Emphasis

This section gathers together a small number of additional typographic features that may differ across scripts.

This slide illustrates alternative, native Japanese, methods of emphasizing text. In the top example, small dots (called wakiten) are placed above the characters to be emphasized – one dot per character. (In vertical text they appear to the right of the character.)

The second example shows emphasis being indicated by the use of a light shaded box behind the relevant characters. This is called amikake.

Note also, as was mentioned earlier, that emphasis can be achieved in German by widening the spaces between characters.

slide Go to individual slides view. View text for this slide. Go to overview.

Emphasis in Cyrillic is commonly achieved by italicization, as in Latin text, however italicization of Cyrillic typically changes certain glyphs in a systematic way. It cannot be achieved simply by distorting the non-italicized text slightly. Firstly, many characters adopt a more rounded shape.

slide Go to individual slides view. View text for this slide. Go to overview.

Other Cyrillic letters adopt a very different base shape during italicization.

slide Go to individual slides view. View text for this slide. Go to overview.

Kumimoji and warichu

As an example of other typographic effects that may need to be supported, Japanese typography frequently uses approaches such as kumimoji and warichu.

Kumimoji (top line on the slide) refers to composites consisting of up to 5 characters that are reduced in size and combined to fit within the space of a single character. Such arrangements can be created as needed by the user if there is a capability to display the text correctly.

Warichu (bottom line on the slide) is a run of text of reduced font size that appears inside of a line of text as two lines of equal height and length.

(These examples and definitions are taken from the CSS3 Text Module.)

slide Go to individual slides view. View text for this slide. Go to overview.

Summary

slide Go to individual slides view. View text for this slide. Go to overview.

Sorting & case conversion

Sorting

Basic Latin

Unicode based text cannot be sorted by simply comparing code point values, any more than English in ASCII given its separation of upper and lower case.

A typical approach to sorting text will make several passes on the data – commonly three or four. This and following slides introduce the basic concept at a high level for Latin text, as a base for discussion of non-Latin text later.

Prior to the initial collation pass, the data may need to be pre-processed, for example to resolve the underlying order of abbreviations.

In the initial pass accents are typically ignored in favor of a sort based on the primary letters of the alphabet. In certain cases, however, accents may be an integral part of the primary letters or other criteria may become important. The slide illustrates some examples.

In Swedish, accents create letters that are seen as independent letters of the alphabet. These letters appear after z in the sorted order.

In Czech the sequence of two letters ‘ch’ is treated as a single, primary unit for sorting. Words starting with ch sort after h, not after c.

In German the a-umlaut may be sorted as if it represented two primary letters ae, with no umlaut at all.

Also German provides an example of a primary sort letter than has no equivalent in upper-case. The ß and ss are treated as equivalent for sorting.

Even within a single script these common differences cause the appropriate alphabetic order of letters to vary from language to language.

slide Go to individual slides view. View text for this slide. Go to overview.

In the second pass accents are typically addressed for entries that would otherwise be the same.

The preferred order of diacritics may vary from language to language, and to some extent from application to application.

The normal way to address ordering is to apply the rules for accent order from the front of the word towards the back. French, however, does the opposite, resulting in the different sequence illustrated on the slide.

slide Go to individual slides view. View text for this slide. Go to overview.

When accents have been dealt with, any differences in case may be addressed in a third pass. Whether upper-cased letters are sorted first or second varies from application to application and platform to platform.

slide Go to individual slides view. View text for this slide. Go to overview.

Arabic

A language written with Arabic script will also need to initially ignore diacritics such as the short vowels so that the two versions of the word above sort in the same place.

Note that, whereas most of the examples on the previous slides typically involve precomposed characters, here we are dealing with one string that has 5 characters and another that has 9 characters.

slide Go to individual slides view. View text for this slide. Go to overview.

Whereas for most languages using the Arabic script sorting follows the kinds of alphabetic principles already discussed, sorting of words for the Arabic language often uses a much more complicated approach.

If you look for the word pronounced ‘maktaba’ in a dictionary you are not likely to find it in the normal phonetic ordering of entries. This is because Arabic sorting makes use of the fact that Arabic words are commonly based on letter patterns or roots (usually of three letters), and groups words that are based on the same pattern together. In this case the underlying pattern we need to look for is k‑t‑b.

slide Go to individual slides view. View text for this slide. Go to overview.

When we look up ‘kataba’ (to write) we will find words such as those shown on this slide as sub-entries, including the word we were looking for, ‘maktab’.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows the meaning rather than the pronunciation of the words on the previous slide. Note how they are all related in meaning because they share the same root.

slide Go to individual slides view. View text for this slide. Go to overview.

Thai

A sort key for Thai begins with the Thai consonants, then continues with the vowel signs. Thus all consonants without vowel signs come before those with vowel signs. In addition, we have seen how there is often more than one vowel sign associated with a base consonant in Thai. Each of these groups of vowel signs sorts in a unique position.

The slide illustrates an additional phenomenon. Vowel signs that are displayed before the base consonant are sorted as if they appeared after it, ie. following the pronunciation. Because these vowel signs are not combining characters but just ordinary spacing characters in Thai, this means that some pre-processing is typically required before sorting can begin. It also means that words beginning with a vowel sign appearing to the left of its base consonant will appear in many different places in a list of words, as shown above.

slide Go to individual slides view. View text for this slide. Go to overview.

Korean

Korean is very simple to sort in Unicode if you are only dealing with hangul characters, since they are all already in sorted order in the character set.

Of course embedded hanja, Latin text, punctuation and the like introduce additional complexity.

slide Go to individual slides view. View text for this slide. Go to overview.

Chinese & Japanese

Chinese and Japanese can be sorted in a number of ways that use a radically different approach. These different sort methods may be application dependent. For example, a bilingual dictionary, an on-line help index, a telephone book, and a kanji lookup table may all sort in different ways.

The slide shows possible orderings for a number of Japanese words.

A typical sort order for a bilingual dictionary would be based on the underlying sound. The order would be based on syllabic sounds, not characters, the same as the traditional ordering of kana syllables (a, i, u, e, o, ka, ki, ku, ke, ko, … ). Because most kanji characters have at least two possible pronunciations, this means that words beginning with a character such as that seen in ‘mottomo’ and ‘saigo’ on the slide sort in different places in the list.

A JIS sort just arranges the characters in order they appear in the JIS coded character set. Common characters in the JIS character set are arranged according to one (usually Chinese-derived) possible pronunciation for that character. You just have to know which sound it is. Less common characters are sorted further down the list by radical (see the next slide). This means that you need to know how common the character is to know where to look for it in the list. All words starting with the same character do, however, sort together.

slide Go to individual slides view. View text for this slide. Go to overview.

Radical and stroke ordered lists ignore pronunciation altogether and sort characters on the basis of their shape.

In the former case, the first sort pass uses the radical of each character to sort text according to the traditional position of that radical in the historical list of 214 radicals. (The 214 radicals are grouped by number of strokes, but for radicals with the same number of strokes the order just has to be known. Note that the radical for the character representing the word ‘umi’ is classed as a four-stroke radical, although when the radical appears in the variant form used here the radical is drawn using only three strokes.) Characters that share the same radical are sorted in a second pass on the basis of the total number of remaining strokes in the character. If the radical and the number of remaining strokes is the same for more than one character, you just need to scan the list (entries that start with the same character(s) are grouped together).

Stroke-based sorting starts by ordering according to the total number of strokes in the character. Where there is a tie, the rules of the radical stroke sort are applied.

slide Go to individual slides view. View text for this slide. Go to overview.

Multilingual text

This slide introduces an additional issue when sorting. If you have text in a number of different scripts and/or languages, how do you order them?

The Unicode Consortium has developed the Unicode Collation Algorithm (UCA) to provide a default ordering for multilingual text that is consistent and predictable. This algorithm addresses all the characters in the Unicode repertoire. Note, however, that this is only a default ordering. Language-specific tailoring should be applied to the UCA to help users find text.

Often the most appropriate order may depend on who is reading the list. If a Kazakh is reading the list they may expect to see the Cyrillic entries at the top of the list, not the bottom as shown here. Then one needs to decide how to accommodate any differences between the Kazakh preference for sorting and those for the Russian shown on the slide – both use variants of the Cyrillic alphabet, but you can only sort in the way required by one language at a time.

To achieve true usability, you typically have to know exactly who the user is, and what their expectation would therefore be.

slide Go to individual slides view. View text for this slide. Go to overview.

Indexing & alphabetic ordering

The implications of different sort orders mean that a presentation such as that shown on the slide needs significant work during localization. The alphabet at the top and the links need to be changed and the order of all elements in the index needs to be changed.

It is important to ensure that this can be accomplished as easily as possible.

Also, consider the potential cost implications of sorting, say, entries in a reference book alphabetically. The order of the material in the book will need to change during translation.

Content developers should wherever possible use automation to create indices and sorted lists, but those automated tools, of course, need to be able to cope with the necessary range of language- and script-specific tailorings.

slide Go to individual slides view. View text for this slide. Go to overview.

Case conversion

A distinction between upper and lower case applies to Latin, Cyrillic, Greek and Armenian scripts. (Georgian makes a distinction between two variants of a character that has been compared to a case distinction, but in modern Georgian is not used.)

Like sorting, case conversion in Unicode cannot be achieved by simply adding or subtracting an offset to a code value. In different Unicode blocks the arrangement of upper and lower case variants is different. Also, mappings are not always straightforward and repeatable, as shown in the Turkish example on the top line of the slide.

Case conversion, like sorting, is also subject to different rules according to the language or dialect in question. The second line alludes to rules for accentuation of upper case letters that differ between European French and Canadian French. In Greek, syntactic differences affect the choice.

The third line shows mappings that are not one to one in German and French.

The fourth line shows an alternate mapping based on the distinction between lower case, upper case and title case in Serbian.

The Unicode database provides semantic information to assist in converting characters between upper and lower case.

slide Go to individual slides view. View text for this slide. Go to overview.

Summary

slide Go to individual slides view. View text for this slide. Go to overview.

Key sources

The top two sources provide very accessible information if you wish to delve deeper into most of the topics covered in this tutorial.

Author: Richard Ishida.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content created February, 2003. Last update 2005-03-29 16:29 GMT