Slide

The Unicode Standard assigns a unique scalar number to every character in its character set. The resulting numbered set is referred to as a coded character set. Units of a coded character set are known as code points.

In Unicode there are a number of ways of encoding the same character. These include UTF-8, UTF-16, and UTF-32.

UTF-8 uses 1 byte to represent characters in the old ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.

UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.

UTF-32 uses 4 bytes everywhere. In the chart on the slide, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.

This explanation glosses over some of the detailed nomenclature related to encoding. More detail can be found in Unicode Technical Report #17.


Copyright © 2003-2005 Richard Ishida. All rights reserved.