
You can see from the previous slide that nearly all Unicode characters are encoded using multiple bytes. In an encoding such as UTF-8 the number of bytes actually used depends on the character in question. This means that care has to be taken to recognize and respect the integrity of the character boundaries. Applications cannot simply handle a fixed number of bytes when performing editing operations such as inserting, deleting, wrapping, cursor positioning, etc. Collation for searching and sorting, pointing into strings, and all other operations similarly need to work out where the boundaries of the characters lie in order to successfully process the text. An ASCII character such as 'a' will be represented by a single byte.
Copyright © 2003-2005 Richard Ishida. All rights reserved.