This document contains examples in another language or script.

Accesskey n skips to in page navigation. Skip to the content start

ishida >> writing

An Introduction to Writing Systems:
A review of script characteristics affecting computer-based script support and Unicode

Front matter

Intended audience

Anyone who wants to better understand how scripts work in computerised environments, and more particularly with regards to Unicode. The material should be accessible for a wide audience, from software engineers to managers.

While the tutorial is perfectly accessible to beginners, it has also attracted very good reviews from people at an intermediate and advanced level, due to the breadth of scripts discussed. No previous knowledge is assumed.

Why should you read this?

When planning to introduce products into new markets it is important to understand the impact of having to support different scripts. The tutorial will make clear that this is not usually a trivial issue, and if you need to implement support, it may involve decisions at a very early stage in the design process.

This tutorial is particularly useful for people who are new to Unicode, in that it provides an overview of the basics in the context of real examples.

Objectives

The tutorial will provide you with an understanding of key requirements for implementing writing systems in information technology. It will do this by examining real examples of a wide range of modern scripts to discover features that a computerized implementation must support. It will also make special reference, where appropriate, to how the Unicode Standard points the way forward for meeting these requirements.

The tutorial does not provide detailed coding advice, but does provide the essential background information you need to understand the fundamental issues. It will also constitute an excellent orientation for newcomers to the topic, providing a wide-ranging framework that assists in assimilating further, more detailed and specific information.

Naturally, given the tutorial format this is an ambitious approach, and it will mean that we cannot go into great detail on any particular topic. If you would like to understand a topic better, there are a couple of excellent resources cited at the end of the tutorial, one of which is the very readable Unicode Standard itself.

How to use this material

This material is organized around a set of presentation slides which can be viewed in several ways. Each view is identified by an icon as described below.

Icon for viewing the slide by slide version. All in one A single page containing all explanatory text followed by small accompanying slides.

Icon for viewing the slide by slide version. Slide by slide One page per slide view. This is particularly useful if you need to see the detail on a slide.

Icon for viewing the text version. Slide text This page by page version of the slides is provided mainly for those who want to cut and paste the text on the slides. (You will need appropriate fonts and rendering software to see the text correctly.)

Please send any comments to ishida@w3.org.

Scripts addressed and Reference examples

We will organize the material in the tutorial by concept, rather than by script. To help you, the script or scripts to which the concept applies will always be listed at the top right of the slide.

The list of scripts includes:

The tutorial covers most of the key features of each of these scripts.

There is a set of web pages with sample text in each of the scripts we will address. Each of the sample pages is a translation of the same English text. We will use these samples to illustrate as many of the points made as possible. That way you will be able to experiment with the examples yourself. In fact, where I have taken an example from a sample page I have typically included the text of that sample on the slide to help you locate real instances more easily.

If you use these examples for your own material, please ensure that you cite this paper and the web site as a source reference.

slide Go to individual slides view. View text for this slide. Go to overview.

Large character sets

CJK character sets

Chinese

Initially there was only one type of Chinese – what we now call Traditional Chinese. Then in the 1950s Mainland China introduced a Simplified Chinese. It was simplified in two ways:

  1. the more common character shapes were reduced in complexity,

  2. a relatively smaller set of characters was defined for common usage than had traditionally been the case (resulting in the mapping more than one character in Traditional Chinese to a single character in the Simplified Chinese set).

This slide shows Traditional Chinese above and Simplified Chinese below.

Traditional Chinese is still used to write characters in Taiwan and Hong Kong, and much of the Chinese diaspora. Simplified Chinese is used in Mainland China and Singapore. It is important to stress that people speaking many different, often mutually unintelligible, Chinese dialects would use one or other of these scripts to write Chinese – ie. the characters do not necessarily represent the sounds.

There are a few local characters, such as for Cantonese in Hong Kong, that are not in widespread use. In Chinese these ideographs are called hanzi. They are often referred to as Han characters.

There is another script used with Traditional Chinese for annotations and transliteration during input. It is called zhuyin or bopomofo, and will be described in more detail later.

It is said that Chinese people typically use around 3-4,000 characters for most communication, but a reasonable word processor would need to support at least 10,000. Unicode supports over 70,000 Han characters.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows examples of contrasting shapes in Traditional and Simplified ideographs.

The characters on the left are one ideograph; the characters on the right are another. Characters at the top are Traditional shapes; characters at the bottom are Simplified.

Note that each of the large glyphs shown above is a separate code point in Unicode. The Simplified and Traditional shapes are not unified unless they are extremely similar. (Han unification will be explained in more detail later.)

slide Go to individual slides view. View text for this slide. Go to overview.

Japanese

Japanese uses three native scripts in addition to Latin (called romaji), and mixes them all together.

Top right on the slide is an example of ideographic characters, borrowed from Chinese, which in Japanese are called kanji. Kanji characters are used principally for the roots of words.

The example at the top left of the slide is written entirely in hiragana. Hiragana is a native Japanese syllabic script typically used for many indigenous Japanese words (as in this case) and for grammatical particles and endings. The example at the bottom of the slide shows its use to express grammatical information alongside a kanji character (the darker, initial character) that expresses the root meaning of the word.

Japanese everyday usage requires around 2,000 kanji characters – although Japanese character sets include many thousands more.

slide Go to individual slides view. View text for this slide. Go to overview.

The example at the bottom of this slide shows the katakana script. This is used for foreign loan words in Japanese. The example reads ‘te-ki-su-to’, ie. ‘text’.

slide Go to individual slides view. View text for this slide. Go to overview.

On this slide we see the more common characters from the hiragana (left) and katakana (right) syllabaries arranged in traditional order. A character in the same location in each table is pronounced exactly the same.

With the exception of the vowels on the top line and the letter ‘n’, all of the symbols represent a consonant followed by a vowel.

Voiced consonants are indicated by attaching a dakuten mark to the unvoiced shape. The ‘p’ sound is indicated by the use of a han-dakuten (compare glyphs for ‘ha’, ‘ba’, and ‘pa’).

A small ‘tsu’ (っ) is commonly used to lengthen a consonant sound.

Small versions of や, ゆ, and よ are used to form syllables such as ‘kya’ (きゃ), ‘kyu’ (きゅ), and ‘kyo’ (きょ) respectively.

When writing katakana the mark ー is used to indicate a lengthened vowel.

slide Go to individual slides view. View text for this slide. Go to overview.

The example at the top of the slide shows the small tsu being used in katakana to lengthen the ‘t’ sound that follows it. This can be transcribed as ‘intanetto’.

The bottom example shows usage of other small versions of katakana characters. The transcription is ‘konpyuutingu’. In the first case the small ‘yu’ combines with the preceding ‘pi’ to produce ‘pyu’. In the second case the small ‘i’ is used with the preceding ‘te’ syllable to produce ‘ti’ – a sound that is not native to Japanese. (Their equivalent would be ‘chi’.)

The bottom example also shows the use of the han-dakuten and dakuten to turn ‘hi’ into ‘pi’ and ‘ku’ into ‘gu’.

There is also a lengthening mark that lengthens the ‘u’ sound before it.

slide Go to individual slides view. View text for this slide. Go to overview.

Korean

Korean uses a unique script called hangul. It is unique in that, although it is a syllabic script, the individual phonemes within a syllable are represented by individual shapes. The example shows how the word ‘ta-kuk-o’ is composed of 7 jamos, each expressing a single phoneme. The jamos are displayed as part of a two dimensional syllabic character.

Note that the initial jamo in the last syllable is not pronounced in initial position and serves purely to conform to the rule that hangul syllables always begin with a consonant.

It is possible to store hangul text as either jamos or syllabic characters in Unicode, although the latter is more common. Unicode enables both approaches.

South Korea also mixes ideographic characters borrowed from Chinese with hangul, though on nothing like the scale of Japanese. In fact, it is quite normal to find whole documents without any hanja, as the ideographic characters in Korean are called.

There are about 2,300 hangul characters in everyday use, but the Unicode Standard has code points for around 11,000.

slide Go to individual slides view. View text for this slide. Go to overview.

Visual characteristics

Note how because all the characters above are mono-spaced and fit within the same sized box the text on the slide gives the appearance of a grid. Grid layouts are actually a common typographic convention in East Asian scripts.

When half-width or proportionally-spaced characters are introduced, there is a possibility of this grid being corrupted, but typographic devices are available to provide several possible solutions to this.

You can experiment with various types of grid setting using CSS on the following web pages:

slide Go to individual slides view. View text for this slide. Go to overview.

Han and kana characters are usually full-width, whereas latin text is half-width or proportionally spaced.

Half-width katakana characters do exist, and for compatibility reasons there is a Unicode block for half-width kana characters. These codes should not normally be used, however. They arise from the early computing days when Japanese had to be fitted into a Western-biased technology.

Similarly, it is common to find full-width Latin text, especially in tables. Again, there is a Unicode block dedicated to full width Latin characters and punctuation, but a font should be used instead.

slide Go to individual slides view. View text for this slide. Go to overview.

Radicals

A radical is an ideograph or a component of an ideograph that is used for indexing dictionaries and word lists, and as the basis for creating new ideographs. The 214 radicals of the KangXi dictionary are universally recognised.

The examples enlarged on the slide show the ideographic character meaning ‘word’, ‘say’ or ‘speak’ (bottom left), and three more characters that use this as a radical on their left hand side.

slide Go to individual slides view. View text for this slide. Go to overview.

The visual appearance of radicals may vary significantly.

Here the radical shown on the previous slide is seen as used in Simplified Chinese (top right). Although the shape differs somewhat it still represents the same radical.

On the bottom row we see the ‘water’ radical being used in two different positions in a character, and with two different shapes. This time the right-most example is found in both simplified and traditional forms.

slide Go to individual slides view. View text for this slide. Go to overview.

Unicode dedicates two blocks to radicals. The KangXi radicals block depicted here contains the base forms of the 214 radicals.

The CJK Radicals Extension contains variant shapes of these radicals when they are used as parts of other characters or in simplified form. These have not been unified because they often appear independently in dictionaries indices.

Characters in these blocks should never be used as ideographs.

slide Go to individual slides view. View text for this slide. Go to overview.

Implementing multi-byte characters

In the early days of computing a byte consisting of 7 bits; allowing for a code page containing 128 code points. This was the day of ASCII.

slide Go to individual slides view. View text for this slide. Go to overview.

When bytes contained 8 bits they gave rise to code pages containing 256 code points. These code pages typically retain the ASCII characters in the lower 128 code points and add characters for additional languages to the upper reaches. On the slide we see a ‘Latin1’ code page, ISO 8859-1, containing code points for Western European languages.

slide Go to individual slides view. View text for this slide. Go to overview.

Unfortunately, 256 code points was not enough even to support the whole of Europe – not even Latin based languages such as Turkish, Hungarian, etc. To support Greek characters you might see the code points re-mapped as shown on the slide (left hand side). These alternative code pages forced you to maintain contextual information so that you could determine the intended character from the upper ranges of the code page. It also made localization difficult since you had to keep changing code pages.

slide Go to individual slides view. View text for this slide. Go to overview.

East Asian computing immediately faced a much bigger problem than in Europe, as can be seen by the size of these common character sets. They resorted to double-byte coded character sets. Two-byte character sets provided 16 bits, and would allow for 216 (ie. 65,356) possible code points. In reality these character sets tended to be based on a 7-bit model, utilizing only a part of the total space available.

One particular problem persisted here – these character sets and their encodings were script specific. It was still difficult to represent Chinese, Korean and Japanese text simultaneously.

slide Go to individual slides view. View text for this slide. Go to overview.

Unicode and ISO 10646 from the start enabled the use of the full range of 16-bit code points for their Basic Multilingual Plane (BMP). This meant that all of the above scripts and more could be represented simultaneously with ease. Localization also became easier, since there was no need to enable new code pages or switch encodings – you simply began using the characters in the appropriate part of the BMP.

slide Go to individual slides view. View text for this slide. Go to overview.

Recently Unicode and ISO 10646 have defined 16 supplementary planes, each the same size as the BMP, for future expansion. Some of those planes are being populated already. There are code points defined for additional alphabets and a large number of math characters in the Supplementary Multilingual Plane (SMP). Also a large number of additional ideographic characters have been added to the Supplementary Ideographic Plane (SIP).

In total there are now over one million code points available.

slide Go to individual slides view. View text for this slide. Go to overview.

Unification

Unicode provides a superset of most character sets in use around the world, but tries not to duplicate characters unnecessarily. For example, there are several ISO character sets in the 8859 range that all duplicate the ASCII characters. Unicode doesn't have as many codes for the letter 'a' as there are character sets - that would make for a huge and confusing character set.

The same principal applies for Han (Chinese) characters. The initial set of sources for Han encoding in Unicode laid end to end comprised 121,000 characters, but there were many repeats, and the final Unicode tally for all these after elimination of duplicates was 20,902. (There are now over 70,000 Han characters encoded in Unicode.)

If Han characters had different meanings or etymologies, they were not unified. Han characters, however, are highly pictorial in nature. So the (dis-) unification process had to take into account the visual forms to some extent. Where there was a significant visual difference between han characters that represented the same thing they were allotted to separate Unicode code points. (Unifying the Han characters is a sophisticated process, carried out over a long period by many East-Asian experts.)

Factors such as those shown on this slide prevent unification, ie.

slide Go to individual slides view. View text for this slide. Go to overview.

What is left for unification are characters representing the same thing but exhibiting no visual differences, or relatively minor differences such as different sequence for writing strokes, differences in stroke overshoot and protrusion, differences in contact and bend of strokes, differences in accent and termination of strokes, etc.

slide Go to individual slides view. View text for this slide. Go to overview.

Encoding methods

Although the terms character set and character encoding are often treated as the same thing, we will use them to mean separate things in this tutorial.

A character set or repertoire comprises the set of atomic text elements you will use for a particular purpose – be it the repertoire of characters required to support Western European languages in computers, or the repertoire of characters a Chinese child will learn at school this year (nothing to do with computers).

The character encoding reflects the way these abstract characters are mapped to numbers for manipulation in a computer.

In a standard such as ISO 8859 encodings tend to use a single byte for a given character and the encoding is straightforwardly related to the position of the characters in the set.

The above distinction becomes helpful when discussing Unicode because the set of characters (ie. the character set) defined by the Unicode Standard can be encoded in a number of different ways. The type of encoding doesn’t change the number or nature of the characters in the Unicode set, just the way they are mapped into numbers for manipulation by the computer (see the next slide).

Similarly, on the Web, the document character set of an XML or HTML document is always Unicode. A particular XML or HTML document, however, can be encoded using any encoding, even encodings that don’t cover the full Unicode range such as ISO 8859-1 (Latin1). However, because the document character set is Unicode, even if a Web page uses Latin1 as its encoding, it can use special constructs called numeric character references (eg. ሴ) to include any Unicode character outside that encoding.

Character encodings are the things that have names in the IANA registry.

slide Go to individual slides view. View text for this slide. Go to overview.

The Unicode Standard assigns a unique scalar number to every character in its character set. The resulting numbered set is referred to as a coded character set. Units of a coded character set are known as code points.

In Unicode there are a number of ways of encoding the same character. These include UTF-8, UTF-16, and UTF-32.

UTF-8 uses 1 byte to represent characters in the old ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.

UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.

UTF-32 uses 4 bytes everywhere. In the chart on the slide, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.

This explanation glosses over some of the detailed nomenclature related to encoding. More detail can be found in Unicode Technical Report #17.

slide Go to individual slides view. View text for this slide. Go to overview.

Respecting character boundaries

You can see from the previous slide that nearly all Unicode characters are encoded using multiple bytes. In an encoding such as UTF-8 the number of bytes actually used depends on the character in question. This means that care has to be taken to recognize and respect the integrity of the character boundaries. Applications cannot simply handle a fixed number of bytes when performing editing operations such as inserting, deleting, wrapping, cursor positioning, etc. Collation for searching and sorting, pointing into strings, and all other operations similarly need to work out where the boundaries of the characters lie in order to successfully process the text. An ASCII character such as 'a' will be represented by a single byte.

slide Go to individual slides view. View text for this slide. Go to overview.

Hebrew characters are represented in UTF-8 using two bytes.

slide Go to individual slides view. View text for this slide. Go to overview.

Ideographic characters in the Basic Multilingual Plane are three bytes in UTF-8. The moral of this story is, don’t use bytes, use characters.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide illustrates how things go wrong with old technology that is not multi-byte aware. In this case the author attempted to delete a Chinese character, and the application translated that to mean delete a single byte. This causes a misalignment of all the following bytes, and produces garbage.

slide Go to individual slides view. View text for this slide. Go to overview.

Inputting ideographic characters

Getting to the right character quickly

We have noted that East Asian character sets number their characters in the thousands. So how do you, quickly, find the one character you want while typing?

In the past people have tried using extremely large keyboards, or forcing people to remember the code point numbers for the character. Not surprisingly these approaches were not very popular.

The answer is to use an IME (Input Method Editor). An IME (also called a front-end processor) is software that uses a number of strategies to help you search for the character you want.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide summarizes the typical steps when typing in Japanese using a standard IME for Windows.

The user types Japanese in romaji transcription using a QWERTY keyboard. As they type the transcription is automatically converted to hiragana or katakana. Ranges of characters are accepted by a key press as they go along. To convert a range of characters to kanji, the user presses a key such as the space bar. Typically the IME will automatically insert into the text the kanji that were last selected for the transcription that has been input. If this is not the desired kanji sequence, the user presses the key again and a selection list pops up, usually ordered in terms of frequency of selection. The user picks the kanji characters required, and confirms their choice, then moves on.

Note that there are only a few alternatives for the sequence かいぎ. If the user had looked up かい and ぎ separately they would have been faced each time with a large number of choices. The provision of a dictionary as part of the IME for lookup of longer phrases is one way of speeding up the process of text entry for the user.

Ordering by frequency and memory of the last conversion are additional methods of assisting the user to find the right character more quickly.

slide Go to individual slides view. View text for this slide. Go to overview.

Chinese input methods

Whereas the Japanese romaji input method predominates for Japanese, there are a number of different approaches available for Chinese.

Pinyin was introduced with Simplified Chinese, and is typically used in the same geographical areas, ie. Mainland China and Singapore.

It is essentially equivalent to the romaji input method. The numbers you see in the example above indicate tones. This dramatically reduces the ambiguity of the sounds in Chinese.

One of the problems of pinyin is that the transcription is based on the Mandarin or Putonghua dialect of spoken Chinese. So to use this method you need to be able to speak that dialect.

slide Go to individual slides view. View text for this slide. Go to overview.

A more common input method in Taiwan uses an alphabet called zhuyin or bopomofo. This alphabet is only used for phonetic transcription of Chinese. Essentially it is the same idea as pinyin, but with different letters. The tones in this case are indicated by spacing accent marks (shown only in the top line on the slide) which in Unicode are unified with accents used in European languages. An appropriate font will however display these as tone marks rather than accents in this context.

slide Go to individual slides view. View text for this slide. Go to overview.

A very different approach allows the user to create the desired character on the basis of its visual appearance rather than the underlying phonics.

Changjie input uses just such an approach. The keyboard provides access to primitive ideographic components which, when combined in the right sequence lead to the desired ideograph.

An advantage of an approach such as changjie is that you don’t have to speak Mandarin. A drawback is the additional training required.

Note that pen-based input is another useful approach. (In fact, this is particularly helpful for people who do not speak Chinese or Japanese. Once you master a few simple rules about stroke order and direction, you can use something like Microsoft’s IME Pad to draw and select characters without any knowledge of components or pronunciation.)

slide Go to individual slides view. View text for this slide. Go to overview.

The examples on this slide show the keystrokes required to enter the text used in the previous slides containing pinyin and bopomofo examples.

slide Go to individual slides view. View text for this slide. Go to overview.

Alternative representations of characters

In some cases you may come across an ideograph that your font or your character set doesn’t support. Unicode provides a way of saying, “I can’t represent it, but it looks like this character.”

The approach requires you to add character U+303E immediately followed by a similar looking character. This at least gives the reader a chance to guess at the character that is missing.

slide Go to individual slides view. View text for this slide. Go to overview.

Another way of addressing the same problem is to use the ideographic description characters introduced in Unicode 3.0.

This approach allows you to draw a picture showing what are the various components of the character you can’t represent, and where they appear. The lower line on the slide shows how you would describe the large character near the top. Note that this is interpreted recursively.

Note also, that this should not be treated as in any way equivalent to an existing ideograph when collating strings.

slide Go to individual slides view. View text for this slide. Go to overview.

Summary

slide Go to individual slides view. View text for this slide. Go to overview.

Complex script rendering

Definitions

Before getting into this section it is important to draw attention to the difference between characters and glyphs.

A character is a semantic unit representing an indivisible unit of text in memory.

A glyph is the visual representation of a character or sequence of characters.

The example on the slide shows two glyphs for the single character a, and two glyphs for a single character Han character. This distinction will become very important in this section. For more information about the distinction between characters and glyphs, see Unicode Technical Report #17.

A font, by the way, is a collection of glyphs.

slide Go to individual slides view. View text for this slide. Go to overview.

Combining characters

Arabic & Hebrew short vowels

Arabic and Hebrew scripts usually do not represent short vowel sounds. The languages are so heavily pattern based that readers can adequately guess at the pronunciation of the words.

In circumstances where ambiguity appears, such as the name of the German town Mainz in the example on the slide, short vowels are represented as diacritics attached to the base consonants.

slide Go to individual slides view. View text for this slide. Go to overview.

Here, for example, the slide shows the Arabic word for engineer, pronounced ‘muhandis’.

It is actually written, ‘mhnds’.

slide Go to individual slides view. View text for this slide. Go to overview.

If needed, the short vowels (there are only 3 in Arabic) are represented as shown on the lower line of the slide. Note that the small circle diacritic indicates NO intervening vowel. (Sequences of code points in Arabic and Hebrew on this and following slides will be shown in left to right order, to emphasise that the underlying order is logical.)

These short vowels are separate combining characters in the text stream that are displayed in the same two-dimensional visual space with the base character. Combining characters do not generally appear without a base character.

slide Go to individual slides view. View text for this slide. Go to overview.

Context-sensitive placement of diacritics

When displaying combining characters, care has to be given to appropriate positioning. In the Thai example on the slide, the same character code is used to represent both the tone mark glyphs that are circled. There are not two different characters based on the desired visual position. The font has to work out the best position for the glyph according to the run-time visual context.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide provides another example of context-sensitive positioning of combining characters.

The short vowel ‘i’ in Arabic is usually drawn below the base character. This is normally the only way of distinguishing it from the short vowel ‘a’, which is displayed above the base character.

In this example, however, an additional shadda diacritic is introduced. The shadda is used to lengthen the consonant it is attached to. In that context it is common (though not mandatory) for the ‘i’ vowel diacritic to appear above the base character, but below the shadda so you can still tell it apart from ‘a’.

Note also that this example introduces the idea that you can have more than one combining character associated with a base character.

slide Go to individual slides view. View text for this slide. Go to overview.

Vowel signs

In Indic scripts and scripts derived from them a consonant character carries with it an inherent vowel. The character on the top line on the slide is transcribed ‘ka’, not just ‘k’.

If you want to follow the ‘k’ sound with a different vowel, you append a vowel sign to the consonant character. This vowel sign overrides the inherent vowel with a different sound.

In Indic scripts vowel signs are all combining characters. Unlike the Arabic and Hebrew short vowels, however, some of these combining characters may also take up additional space on a line (see the example ‘kii’ on the slide). They are referred to as spacing combining characters.

slide Go to individual slides view. View text for this slide. Go to overview.

Thai, being derived from Indic scripts, also has vowel signs, although they are used in a slightly more complex way.

In the example on this slide, three vowel signs surround the consonant to produce the desired effect.

Whereas in the Indic scripts all vowel signs are combining characters, only one of the vowel signs in this example is combining. The other two (indicated by arrows) are normal spacing characters. This is a distinction introduced to Unicode at the request of the Thai national standards body.

slide Go to individual slides view. View text for this slide. Go to overview.

Coding combining characters

When it comes to implementing combining characters, an important question to ask is what order should be applied to them and the base character. Unless you have agreement on this, you can have serious problems when passing data between systems.

The Unicode Standard requires that all combining characters follow the base consonant in a Unicode string. (So the example to the left on the slide is correct.)

slide Go to individual slides view. View text for this slide. Go to overview.

Each combining character has a combining class property expressed as a numeric value. Combining characters that appear in the same location relative to the base character when displayed will typically share the same combining class. For example acute, grave and circumflex accents all appear above the base character and all share the same combining class.

Multiple combining characters do not have to be in any particular order unless they are in one of the Unicode normalisation forms. The standard requires that sequences of combining characters should be treated as equivalent if they all have different combining classes.

Unicode normalisation, however, applies a canonical ordering to multiple combining characters.

If characters have the same combining class they are likely to interact typographically to produce different possible results, as in the case above. In this case the ‘inside-out’ rule is applied. This rule states that the proximity of the combining character in the text stream must match the visual proximity.

slide Go to individual slides view. View text for this slide. Go to overview.

Precomposed vs. decomposed

There are many precomposed characters in Unicode that have an accent or diacritic already combined with a base character (such as a-acute above). It is however also possible to represent this character using a simple ‘a’ followed by a combining acute accent. This is referred to as a decomposed character sequence.

The Unicode Standard states that both of these approaches must be considered canonically equivalent.

slide Go to individual slides view. View text for this slide. Go to overview.

Normalization

To facilitate the process of string comparison for operations such as searching, sorting and comparison it is helpful to adopt a standard policy with regard to precomposed versus decomposed variants of a character sequence, and the order in which multiple combining characters appear. This can be achieved by applying an appropriate normalization form. The Unicode Standard provides a normalization form called NFD that represents all character sequences in maximally decomposed form. In addition to decomposition, NFD applies a standard order to multiple composing characters attached to a base character. As an alternative, the Unicode Standard offers NFC. NFC is achieved by applying NFD to the text, then re-composing characters for which precomposed forms exist in version 3.0 of the standard.

Note that there are actually some precomposed forms in the Unicode character set that are not generated by NFC, for reasons we will not go into here. In addition, where there is no precomposed form, a character sequence is left decomposed, but canonical ordering is still applied to all combining characters.

The Unicode Standard also offers two more normalization forms, NFKD and NFKC, where K stands for ‘kompatibility’. These forms are provided because the Unicode character set includes many characters merely to provide round-trip compatibility with other character sets. Such characters represent such things as glyph variants, shaped forms, alternative compositions, and so on, but can be represented by other ‘canonical’ versions of the character or characters already in Unicode. Ideally, such compatibility variants should not be used. The NFKD and NFKC normalization forms replace them with more appropriate characters or character sequences. (This, of course, can cause a problem if you intend to convert data back into its original encoding, because you lose the original information.)

slide Go to individual slides view. View text for this slide. Go to overview.

Context-sensitive glyph shaping

Word final glyph variants

In Hebrew and Greek there are certain characters (only a small number) that look different in the middle of a word and at the end. Two examples are shown on the slide. In each example, the same consonant appears in the middle of a word and at the end of a word in the sample text, and has a different appearance.

Due to traditional approaches, these shapes are encoded separately and are typed in using distinct keys on the keyboard. This is manageable because there are so few such characters.

In other scripts a very different approach has to be taken.

slide Go to individual slides view. View text for this slide. Go to overview.

Cursive script

Arabic is often referred to as a cursive script with the meaning that letters in a word are usually joined to each other – whether handwritten or printed.

The slide shows the unjoined form of the letter AIN at the top right, and, at the bottom, three joined examples of of the same letter. As you can see, the shape changes quite dramatically.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows some more examples of un-joined Arabic letters (right column) and their various joining forms (to the left).

It is important to understand that there is only ONE code point here for each letter. The various different visual forms are only font-based glyphs chosen to suit the run-time visual context.

(There are compatibility characters encoded in Unicode for specific joining forms, but these should not be used for storing Arabic text edited in Unicode. They are only provided to allow round-trip conversions between Unicode and legacy character encodings. In Unicode normalized text these are all mapped to the main Unicode Arabic block.)

The shapes on the slide can be referred to (from right to left) as independent, initial, medial and final.

slide Go to individual slides view. View text for this slide. Go to overview.

Inputting cursive glyphs

On previous slides I mentioned the ‘run-time’ context. This is quite important. If I type in the Arabic letter HEH shown at the top of the slide it will initially be in an independent glyph form. If I press exactly the same key on the keyboard and insert exactly the same character alongside it in memory, however, the original letter HEH will be expected to join with the second HEH. The shape of the first HEH will therefore change to ‘initial’, and the second HEH will be in ‘final’ shape. Type another HEH and the second will become ‘medial’, and so on.

In this way Arabic text is constantly changing as you type. The editing application also has to adapt these glyphs as you do things such as backspace, insert or delete text.

slide Go to individual slides view. View text for this slide. Go to overview.

Conjunct consonants

When two Indic consonants appear together without any intervening vowel sound they may form a conjunct, ie. the consonant cluster is rendered as a composite shape. This composite shape may show a vertical or horizontal mixture of the base shapes. In some cases the original constituents of a conjunct may not be recognizable.

One approach that is very common is the use of a half-form to represent the initial consonant in the cluster. An example of this is shown on the bottom line of the slide.

It is important to bear in mind, once again, that this is all glyph magic. The individual consonants are all still represented using the regular code points in memory, it is only the visual appearance that changes. There are no special code points for half-form glyphs. The appropriate glyph is simply applied at display time according to the rendering rules of the script.

slide Go to individual slides view. View text for this slide. Go to overview.

In actual fact, there is a vital ingredient to a conjunct form that we have not yet discussed. It is called a virama. The virama is often called ‘vowel killer’.

If you simply put two consonants side by side in Unicode, as in the top line on the slide, you will get two separate consonants displayed (with the assumption on the part of the reader that there is an inherent vowel between them).

It is only when you put a virama character between them that they combine to form a conjunct. So the conjunct glyph shown middle right actually represents three underlying characters.

The number of conjunct forms can vary from font to font. Some fonts will be capable of rendering more than others. So what happens if the font you are using doesn’t have a conjunct glyph for the combination you want to create?

In this case the virama is shown visually as a combining mark – see the last line on the slide. (In fact, in modern Tamil this is the default approach.)

slide Go to individual slides view. View text for this slide. Go to overview.

More character to glyph rendering

Special joining forms

The concepts we have discussed so far in this section on combining characters and glyph shaping have shown that there is no one-to-one correspondence, as there usually is in English, between the characters in memory and the glyphs displayed on screen. Indeed, sometimes complex rules are needed to determine the displayed result.

We have seen some of the more basic transformation cases, but over the next few slides we will take a quick look at some additional possibilities. This is by no means intended to give you all the information you need to implement these scripts – merely expose you to some slightly more advanced behavior.

First out we look at some font-dependent alternatives for joining Arabic glyphs. Arabic glyphs typically join along the baseline, but in some (typically more classical) fonts, specific pairings join above the baseline as shown in the top left example on the slide.

slide Go to individual slides view. View text for this slide. Go to overview.

The use of half-forms in Indic scripts could also be seen as a kind of special joining form.

slide Go to individual slides view. View text for this slide. Go to overview.

Positioning variation

Spacing combining characters to the left of the base consonant are common in Indic scripts. Here what is important to bear in mind is that the Unicode rule about combining characters following the base character still applies. It is only as part of the rendering process that the glyph for the combining character is made to appear to the left.

The example on this slide shows how the Hindi word for ‘Hindi’ would normally be displayed, but on the second line shows the order of the characters in memory.

slide Go to individual slides view. View text for this slide. Go to overview.

The example text from the Thai sample shown on this slide illustrates the same effect in Thai. This word is pronounced very much like ‘program’, and the vowel sign at the far left is actually pronounced after the third character (ie. it is the ‘o’ sound after ‘pr’).

We have already seen, however, that vowel signs are not necessarily combining in Thai, so no reordering is actually needed in this case. The characters displayed are actually stored in the same order in memory.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows some additional examples of reordering during display.

The top example shows a Tamil combining character that appears on both sides of the base consonant when displayed.

The bottom example shows the Devanagari repha in a consonant cluster. The RA code that appears at the beginning of the cluster in memory is rendered as a diacritic above the vowel sign that completes the syllabic cluster.

slide Go to individual slides view. View text for this slide. Go to overview.

Ligatures

Ligatures are very common. Essentially a ligature is a single glyph that represents more than one underlying character.

The example shown here is of a mandatory ligature in Arabic. An ALEF character followed by a LAM character must always be displayed as a single lam-alef glyph.

slide Go to individual slides view. View text for this slide. Go to overview.

The top line on this next slide shows another Arabic ligature. This ligature is optional and will only be displayed if the font developers included it. In other words, the number of ligatures available will generally vary with the font being used.

The second line shows that ligatures in Arabic also have joining forms when they occur alongside other characters.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows some ligatures used to render Indic consonant clusters.

slide Go to individual slides view. View text for this slide. Go to overview.

Again, the number of ligatures available in a font varies. In some fonts the lower example may simply be rendered using a visible virama.

slide Go to individual slides view. View text for this slide. Go to overview.

Ligatures are not only used for combining consonants. This slide shows the effect of combining a single vowel sign with various consonants in Tamil. As you can see, the combinations produced some complex and vary varied results.

slide Go to individual slides view. View text for this slide. Go to overview.

Joiner & non-joiner control characters

We have seen how Arabic glyphs join up with each other when juxtaposed. Unicode provides some special characters, invisible to the naked eye and to processing algorithms, to help control joining behaviour manually.

The zero-width non-joiner character (U+200C) can be inserted between the three characters LHM to create the effect on the second line. Here the three characters are not separated by spaces, but the glyphs no longer join.

The zero-width joiner character (U+200D), on the other hand, has the opposite effect. The three characters on the third line have spaces between them, but the joiner character is used to produce the joining forms of the glyphs. This behaviour is occasionally needed for correctly rendering Arabic text.

slide Go to individual slides view. View text for this slide. Go to overview.

Unicode allows you to force a consonant + virama sequence to display the virama where the font would otherwise have used a half-form – add a zero width non-joiner immediately after the virama of the dead consonant.

Unicode allows you to force a dead consonant to assume a half-form rather than combine as part of a ligature – place a zero width joiner immediately after the virama.

slide Go to individual slides view. View text for this slide. Go to overview.

Summary

slide Go to individual slides view. View text for this slide. Go to overview.

Text direction

Vertical text

Text flow

Vertical Chinese, Japanese and Korean flows from the right to the left of the page. (All of these scripts can also be read horizontally.)

Vertically oriented text is still very common in printed matter such as books, magazines and newspapers.

slide Go to individual slides view. View text for this slide. Go to overview.

Rotations & shifts

This and the two following slides show differences between the same text when set horizontally and vertically.

On this slide we see how parentheses and vowel lengthening marks are rotated. Bear in mind that this does not reflect any change in the underlying characters – the change is purely in the choice of glyph.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows how punctuation and small kana characters move from one corner of the cell square to the other. This is not a question of rotation.

This is not always an issue. In Chinese a period or a comma is typically centered in the character cell.

slide Go to individual slides view. View text for this slide. Go to overview.

This slide shows the treatment of embedded Latin text. Text typically flows down the line at a 90 degree rotation from the East Asian characters, however acronyms and initials are commonly not rotated.

slide Go to individual slides view. View text for this slide. Go to overview.

Tate chu yoko

Vertical text often includes short runs of horizontal numbers or Latin text, called tate chu yoko. The example on this slide shows the heading of a newspaper article in Korean.

(Note also the use of a hanja character meaning ‘hundred’ between the 3 and the 76.)

slide Go to individual slides view. View text for this slide. Go to overview.

Vertical columns

This slide illustrates the path of the eye in two-column vertical text. The columns, of course, run horizontally. If you are implementing an OCR application, this is an important thing to get right.

slide Go to individual slides view. View text for this slide. Go to overview.

Bidirectional text

Right alignment

Arabic and Hebrew scripts run predominantly from right to left and are right justified.

slide Go to individual slides view. View text for this slide. Go to overview.

Bidirectional ordering

Text in right-to-left scripts does not always flow right to left. Embedded Latin text and all numbers are read left to right. It is for this reason that these scripts are referred to as bidirectional.

The slide illustrates the direction of the eye while reading the line containing a date.

Note that there may be slight differences between the Arabic and Hebrew approaches here. The Arabic for the dates ’10-12’ reads ’12-10’ – only the numbers flow left to right. In Hebrew it is likely that you would see ’10-12’ – the whole expression now runs left to right.

slide Go to individual slides view. View text for this slide. Go to overview.

It is important to understand that the order of characters in memory is unidirectional, following what is called the logical order. The reordering you see on screen or paper is magic worked by the text rendering algorithms of the software.

The order in memory essentially follows the order in which the text is typed.

slide Go to individual slides view. View text for this slide. Go to overview.

East Asian scripts often used to also be read from right to left, though this is much less common nowadays. The example shown on the slide is a Traditional Chinese newspaper where the text of the articles is predominantly vertical. The headings of the articles and the captions of the pictures run right to left. In this way the reading direction of the horizontal text is consistent with the flow of text in the surrounding articles.

Japanese text never does this any more. Even in newspapers containing vertically set text, the titles and captions always run left to right these days.

slide Go to individual slides view. View text for this slide. Go to overview.

Unicode bidirectional algorithm

The Unicode Standard has a Bidirectional Algorithm that should be used to support the display of bidi text. Unlike many character sets, Unicode provides several properties of semantic information for every character. One of these properties indicates the behavior of the character with regard to inline directionality.

All Arabic and Hebrew letters have a directional type of right-to-left. Most other letters have a left-to-right type, including all numbers. Punctuation is typically directionally neutral, since its location depends on the context.

The next slide illustrates the use of this typing.

slide Go to individual slides view. View text for this slide. Go to overview.

As each of the right-to-left typed characters are typed on the first and second lines the bidirectional algorithm will place the next character to the left of the previous one.

Because the numbers have a type of left-to-right, the 0 is automatically added to the right of the 1.

The punctuation relies on context to determine its position, so when it is initially entered the rendering algorithm assumes that it is a sentence-final period and part of the overall right-to-left flow. If a space and some more Hebrew characters was then input the period would remain there.

If, however, the next input character is a number, the rendering algorithm views the period as part of the left-to-right flow of the number, and moves it automatically to the right of the 10.

The bidirectional algorithm is somewhat more complicated than this in reality, but this helps you understand the basic idea.

slide Go to individual slides view. View text for this slide. Go to overview.

Mirrored characters

The treatment of paired characters such as parentheses deserves a brief mention here. The text on the slide flows consistently from right to left. The point of interest is the shape of the parentheses.

The first parenthesis encountered while reading – looks like ')' – is actually a LEFT PARENTHESIS, ie. in left-to-right text it looks like '('. The bidirectional algorithm expects the visual shape of such mirrored characters to be swapped when used in a right-to-left context. (In Unicode 1.0 the LEFT PARENTHESIS was actually referred to as OPENING PARENTHESIS.)

This approach facilitates the matching of character sequences across scripts.

slide Go to individual slides view. View text for this slide. Go to overview.

Bidi formatting control characters

There are occasionally situations where the bidirectional algorithm needs a little help to determine the directional context.

Unicode provides special, invisible control codes to help clarify such ambiguities or intentional deviations from the rules of the bidirectional algorithm.

The sample sentence preceded by the asterisk on the slide shows what you get if you rely solely on the bidirectional algorithm, due to the inherent ambiguity of the phrase.

By applying an embedding control character as shown in the view of the logical order of characters at the bottom of the slide, the correct result can be obtained (the middle line).

NOTE: Although these codes should do the job, in marked up text such as HTML and XML you should not use these control codes but use available markup instead (eg. the dir attribute in HTML).

slide Go to individual slides view. View text for this slide. Go to overview.

Visual selection

The difference between the logical, underlying order of bidirectional text and the displayed order of characters also has an impact on highlighting.

The next slide illustrates what may happen if you place your cursor at the point labeled “Start here” and extend your selection to the point marked “End here”.

slide Go to individual slides view. View text for this slide. Go to overview.

As you can see here, two separate ranges of text have been highlighted, one of which falls outside the two points we mentioned on the previous slide. A look at the underlying codes (see the bottom of the slide) immediately reveals the inherent logic in this approach. Although this operation produced two visual selections, it produced only a single logical selection in memory.

It is also possible to find applications that would have produced a single visual highlight in this case. It is important to note, however, that this would represent two independent logical selections in memory.

slide Go to individual slides view. View text for this slide. Go to overview.

Directional bias in layout & graphics

Screen layout

A predominant reading direction of right-to-left can have an impact on more than just the text. If you look at the Arabic and Hebrew sample pages in Internet Explorer you will see that the scroll bar appears on the left.

In Arabic and Hebrew environments the layout of screen information is typically mirrored to reflect the scanning direction of the text.

The screen shot of an editor on the slide shows the following differences from the English version:

slide Go to individual slides view. View text for this slide. Go to overview.

In addition, the text on pull down menus is on the right and accelerator keys are listed to the left.

If there were submenus they would cascade to the left, not the right as in an English user interface.

slide Go to individual slides view. View text for this slide. Go to overview.

All the items on this dialogue box are mirrored by comparison to the English version.

slide Go to individual slides view. View text for this slide. Go to overview.

Graphics, icons and charts

In addition to layout of user interfaces, directionality may affect the layout of charts, tables, spreadsheets, collated pictures, and the like. (This appears to more consistently the case for Arabic documents than for Hebrew.)

(It is worth visiting the Arabic or Hebrew sample page on the Web to see the effect of changing the directionality of the page on the table cells shown at the top of this slide. Just click on the button provided.)

Also, any graphics showing directional bias will need to be mirrored.

slide Go to individual slides view. View text for this slide. Go to overview.