
Unicode provides a superset of most character sets in use around the world, but tries not to duplicate characters unnecessarily. For example, there are several ISO character sets in the 8859 range that all duplicate the ASCII characters. Unicode doesn't have as many codes for the letter 'a' as there are character sets - that would make for a huge and confusing character set.
The same principal applies for Han (Chinese) characters. The initial set of sources for Han encoding in Unicode laid end to end comprised 121,000 characters, but there were many repeats, and the final Unicode tally for all these after elimination of duplicates was 20,902. (There are now over 70,000 Han characters encoded in Unicode.)
If Han characters had different meanings or etymologies, they were not unified. Han characters, however, are highly pictorial in nature. So the (dis-) unification process had to take into account the visual forms to some extent. Where there was a significant visual difference between han characters that represented the same thing they were allotted to separate Unicode code points. (Unifying the Han characters is a sophisticated process, carried out over a long period by many East-Asian experts.)
Factors such as those shown on this slide prevent unification, ie.
Copyright © 2003-2005 Richard Ishida. All rights reserved.